Scalability, high availability, containers, fault tolerance and eventual consistency. Tech terms can be confusing to those new to server administration or development. In the coming weeks, we’ll be breaking down common — yet potentially confusing — terms you will undoubtedly come across in your learning journey.
Closely related to a system’s uptime and functionality is the concept of high availability. A highly available system ensures above-average availability for the users who are accessing your website, application or program.
Key points of high availability for a system are locating and removing all single points of failure, preventing data loss, and ensuring operational stability at peak times. This is primarily achieved through redundancy.
Redundancy allows for systems to failover to a working version of the problem element, whether this is a database, an instance in a cluster, or a backup filesystem. There are two common forms of redundancy, active/active and active/passive redundancy:
Active/active redundancy involves multiple items of the same kind working to share the load, and detect and bypass system failures of any sort. When there is a failure, one or more of the other available, operating systems, will take over for that failed instance, handling any requests.
While active/active redundancy relies primarily on clustered systems to share and take over any requests in the event of failure, active/passive systems work with keeping an up-to-date secondary system or systems that will be brought online only after a failure.
Uptime and Availability
Uptime and availability are often closely related, but they are not the same. An application may be accessible to users, but with certain features broken due to network failure or otherwise; this application may not be experiencing downtime, per se, but since it is not available in the expected manner for users, it is not considered highly available.
Uptime is measured in percentage an application is up throughout an entire year. It is also measured using the “nines” system.
|Percent Uptime||Nines||Downtime Per Year|
|90%||one nine||36.5 days|
|99%||two nines||3.65 days|
|99.9%||three nines||8.76 hours|
|99.99%||four nines||52.56 minutes|
|99.999%||five nines||5.26 minutes|
Those familiar with AWS may recall the “eleven nines” durability that S3 features. What this means is it has 99.999999999% uptime through a given year.
Planning for High Availability
A highly available system is one that, beyond removing single points of failure, has reliable failure detection, considers system failure from all points, and often relies on automation to complete demands. The elimination of single points of failure often is resolved through redundancy, addressed above; however, for failover or redundancy to appropriately function, the system needs to be aware when there is a failure. Because of this, failure detection is a must, and automation should be set up for the system to take any needed actions to ensure any failures will not affect user experience.
Moreover, when planning for high availability, environmental concerns should also be addressed. Where is your hardware located? Are you prepared to failover in the event of an issue at a data center, or are is all your hardware in one place? If you have immediate access to your hardware, are you prepared to do any emergency replacements in the event of hardware failure yourself?
Costs and Concerns
Unfortunately, high availability comes at a cost. While planning for high availability for simple systems can seem easy, the nature of creating a highly available system itself will add complexity to your design, and complexity adds cost.
Additional instances, backup hardware, reserved servers, and other elements you may need to craft your system need to be weighed against the benefits of maintaining a highly available system. Consider the nature of your application’s downtime and what you can allow for and afford.
While working toward a highly available application can look like a daunting task, knowing the ins and outs of your system to locate and resolve any failure points can provide your users a positive, downtime-free experience, allow you to rest easy knowing you have automation in place, and increase your uptime overall.