What is Fault Tolerance?

Scalability, high availability, containers, fault tolerance and eventual consistency. Tech terms can be confusing to those new to server administration or development. In the coming weeks, we’ll be breaking down common — yet potentially confusing — terms you will undoubtedly come across in your learning journey.

toleranceOne of the primary themes of this series — and a keystone part of working as a systems administrator, engineer, or developer — is not just getting servers to run, but keeping them running. While we’ve already address scalability, high availability, and eventual consistency, another piece of the puzzle is not just avoiding failure, but how to gracefully navigate and plan for any potential failures in the future. For this, we plan out fault tolerant systems.

One of the features of high availability is that the system does not have any single points of failure: Fault tolerance takes this a step further. A fault tolerant system will continue working regardless of hardware failure, data corruption, software errors or operator mistakes. The system or application may have decreased throughput, but should still be functional. Similarly, a fault tolerant system needs to be able to isolate and contain the issue from other working environments.

For most instances, fault tolerant systems are run using either replication or redundancy — both of which are addressed in our high availability post. Replication by having multiple, identical servers sharing the workload, redundancy by having multiple, identical servers waiting to take over should the original server fail. Redundancy is also used on the hardware side: Physical servers may have backup CPUs, RAM, data drives, and more, prepared to take over in the event of failure. (If not, this can also be manually done in a process known as hot swapping, or switching certain pieces of hardware while the system runs.)

Although optimal, a completely fault tolerant environment is often impossible; keeping redundant components for all parts of a system can vastly complicate a system, and it tends toward the prohibitively expensive. On a practical level, when considering fault tolerance for your environment, you should consider what parts of your system are most critical and plan around these components.

Fault tolerance is just one part of keeping a successful system running, but a well-planned, fault tolerance system — even one just targeting critical systems — can provide users, system administrators, and developers alike a sound basis for handling issues as they appear.

Elle K

Elle is a technical writer and Linux aficionado at Linux Academy.

Leave a Reply

Your email address will not be published. Required fields are marked *