The best practice is to plan for as many failure scenarios as conceivable
In our 24/7/365 world, computing infrastructure outages can kill a CIO’s reputation and career prospects swiftly and dramatically. Outages have attained an extremely high profile in most organizations because they visibly and quickly:
- Cost revenue.
- Undermine customer service.
- Cause work to grind to a halt.
- Undermine brand reputation.
Computing infrastructure outages occur for many reasons, including:
- Insufficient capacity.
- Failing to monitor end-to-end response time.
- Sloppy server management.
- Gaps in configuration management processes.
- External and internal network issues.
- DBA finger problems.
- Flaky application execution.
- External and internal electrical power outages.
- Scheduled maintenance taking too long.
|How generative AI is changing the future of customer service
|Your roadmap to improving reporting and data analytics
|Rube Goldberg infects the tech world
At the recent Collision from Home virtual conference, Sebastien Stormacq, Principal Developer Advocate at Amazon Web Services (AWS), explored design patterns to achieve high availability. AWS is a well-known supplier of cloud computing infrastructure. He said, “Modern computing infrastructures embrace failure rather than trying to avoid it. Best practices systems are designed to handle and recover from unexpected conditions.”
The best practices for minimizing outages and achieving high availability of applications focus on consciously planning for as many failure scenarios as conceivable. Below are practical measures that CIOs can implement to:
- Enhance their reputation and preserve their career prospects.
- Reduce the risk of computing infrastructure outages.
- Improve computing infrastructure resilience.
- Achieve high availability.
Migrate applications to the cloud
Most organizations struggle to achieve continuous high availability for their on-premise computing infrastructure. It’s expensive to buy the components and then implement them. It’s difficult to justify the wide range of technical specialists required to operate with high availability because most specialists are not needed full-time.
A better approach is to contract with the suppliers of cloud computing infrastructure. They have accumulated experience and work hard with capable technical teams to achieve often elusive high availability. Because these suppliers operate at a larger scale where the cost of technical specialists is amortized over many customers, the cost per customer and the results are attractive.
Buy failover services
Organizations can implement a failover environment for their on-premise computing infrastructure. However, implementing changes to applications and operating procedures to switch to the failover environment during an outage seamlessly can be challenging. Unfortunately, some organizations discover the gap in their configuration during the first outage with significant negative consequences.
A better approach is to buy one of the levels of failover service that suppliers of cloud computing infrastructure all offer. These automatic failover services often eliminate or at least minimize the impact of the following:
- Computing infrastructure component failures.
- Internet backbone outages.
- Distributed denial of service (DDOS) attacks.
Architect applications for high availability
The post-incident review of computing infrastructure outages most often discovers:
- Single points of failure in the computing infrastructure.
- Application software defects.
- Missing application functionality for resiliency.
Architecting applications for high availability typically includes the following features:
- Extensive data validation on data input.
- An implemented backup and recovery strategy.
- Ability to address data loss or corruption before the fact, not after.
- Awareness of the active servers in the cluster.
- Application segmentation onto multiple servers to distribute functions such as web services, authentication, computing, database access, content management, email, and reporting.
Upgrade your on-premise network
At many organizations, the on-premise network usually works reasonably well. However, it is at risk of outages due to the following:
- Bottlenecks created by high-volume, local traffic.
- Single points of failure.
- Broadcast storms originating from defective Ethernet network cards and switches.
Upgrade your on-premise network to achieve high availability even when you have migrated most of your applications to the cloud. Network upgrades to consider include:
- Fewer end-user devices per switch to minimize sharing of the local network segment capacity.
- Multiple paths from switches to the edge of the local network to create multiple paths for network redundancy.
- More cable for Gigabit Ethernet and less Wi-Fi to improve network reliability and performance.
- Use of subnets to split a network into multiple smaller, interconnected networks to isolate local traffic to the local subnet whenever possible.
- Load balancing routers at the edge of your network to spread network traffic and server workload.
- Multiple ISP connections to spread the external network traffic and create redundant paths to the Internet.
- Active network monitoring to detect intrusions.
Addressing the shortcomings in your computing infrastructure will keep your customer happy and preserve your reputation in our 24/7/365 world.
Yogi Schulz has over 40 years of information technology experience in various industries. Yogi works extensively in the petroleum industry. He manages projects that arise from changes in business requirements, the need to leverage technology opportunities, and mergers. His specialties include IT strategy, web strategy and project management.
For interview requests, click here.
The opinions expressed by our columnists and contributors are theirs alone and do not inherently or expressly reflect the views of our publication.
© Troy Media
Troy Media is an editorial content provider to media outlets and its own hosted community news outlets across Canada.