What is fault tolerance?
Neither man nor software is free from errors. In interfaces and systems, functionality can therefore only be guaranteed if not every fault has devastating effects. The key factor is fault tolerance.
If developers and developer systems programming and interface design, then most and the worst errors are eliminated during development to ensure full operability. But even a rigorous and comprehensive testing process can never be thorough enough to completely rule out all errors.
The larger the system, the more likely it is to fail. Nevertheless, even gigantic and complex interwoven systems such as Google and Facebook remain operational around the clock without downtime, and not because errors are impossible. Rather, an established fault tolerance is responsible for ensuring that even individual points of failure do not lead to a complete system failure.
What does fault tolerance mean in practice?
There are fault-tolerant systems in different IT contexts (in hardware, software, in networks or in distributed systems), whereby other solutions always have to be found in order to make the system sufficiently immune to errors. This is illustrated using simple examples:
Hardware: Hardware systems can, for example, be supported by identical or equivalent systems so that mirrored hard disks or servers running in parallel can take over the tasks of the faulty system in the event of a failure until the fault has been eliminated. RAID storage solutions fall into this category.
Software: Databases , storage systems or program versions can be made error-resistant through constant backups, since the failure of the primarily running software does not lead to a complete crash. Instead, secondary software can be used.
Energy supply: A special part of the hardware protection is the guarantee of the energy supply, which is why critical servers often rely on emergency generators in order to be prepared for disasters.
Human error source – error tolerance in interfaces
To err is human – and so it should come as no surprise that users are often responsible for a large number of anticipated and unpredictable errors. It is therefore particularly important to set up a fault tolerance at interfaces so that user software cannot crash through unintentional or deliberate incorrect entries. Therefore, fault tolerance also represents protection against deliberately malicious DoS attacks and the like.
Error categories in interfaces
Avoidable Mistakes
This form of error is usually based on an error in the UI design , as the target group was not correctly assessed or the navigation behavior was not adequately investigated. Users then behave differently in practice, fill out forms incorrectly or click on the wrong link. Experience and, above all, long tests can actually usually avoid these errors.
Known but unavoidable mistakes
This type of error is known at the interface, but cannot be completely ruled out. Users do not fill out forms completely, send forms unseen or do not tick important explanations. Such interface errors should not completely reset the interface or lead to an error status, but rather lead to corresponding correction pages.
Unpredictable errors
The most serious type of error is the one that cannot be foreseen, since programmers simply cannot be predicted here, which ultimately leads to an error at the interface. Often it is also attacks that can lead to unforeseeable errors.
How robustness is created at the software level
In order to keep the software and interfaces running, the fault tolerance usually leads to one of two different paths.
The forward error correction simply ignores the error and replaces inputs or results with empirical or expected values. What the error actually was does not matter, as the program simply continues as usual. This works in the same way as auto-correction on the smartphone, in which an incorrect entry is replaced / supplemented by predictive values.
The backward error correction takes note of the error and resets the software to a state before the error in order to continue working from the last known correct state. An example of this would be a RAID hard drive system in which the failure of a drive is ignored backwards and is loaded from the last saved status of the operational drives.
Creating fault tolerant systems in practice
For practical programming applications, there are different methods with which fault-tolerant software, interfaces and systems can be developed. One example of this are circuit breakers, which simply redirect in the event of an error. Users include Netflix, for example, where failed service requests are redirected – tools such as Hystrix or Resilience4j are available for this.
Another example of practical fault tolerance is load balancing, in which overloads are diverted into redundant capacities. On the software side, a high level of utilization is distributed over various nodes in order to better manage the resources and prevent downtime. HaProxy and Nginx are popular tools for this.
Theoretical fault tolerance is practical robustness
No interface, no software, no system and no web service can do without errors. With increasing complexity and duration of operation, errors can also be ruled out less and less.
All the more attention should be paid to fault tolerance, error correction and system robustness with minimal downtime. Ideally, the error tolerance is so high that users don’t even notice that an error has happened behind the scenes or on the keyboard.