What is Fault Tolerance in Distributed System

December 8, 2017 Author: rajesh
Print Friendly, PDF & Email

The use of technology has increased vastly and today computer systems are interconnected via different communication medium. The use of distributed systems in our day to day activities has solely improved with data distributions. This is because distributed systems enable nodes to organize and allow their resources to be used among the connected systems or devices that make people to be integrated with geographically distributed computing facilities. The distributed systems may lead to lack of service availability due to multiple system failures on multiple failure points.

Definition of Fault Tolerance

In a broad sense, fault tolerance is associated with reliability, with successful operation, and with the absence of breakdowns. A fault-tolerant system should be able to handle faults in individual hardware or software components, power failures or other kinds of unexpected disasters and still meet its specification.

A fault-tolerance is the ability of a system to continue correct performance of its intended tasks after the occurrence of hardware and software faults. Fault tolerant system research covers a wide spectrum of applications namely embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems, distribution and service systems, etc. Fault tolerance approach in any system results in the improvement as far as the efficiency and performance is concerned. During last few decades many researchers have contributed in the development of fault tolerance techniques that explore the performance of computer systems which are prone to software or hardware failures.

To tackle fault tolerance related issues, the following terminology are commonly used:

View of Error, Failure and Fault

Figure 1: View of Error, Failure and Fault

Faults- A fault, sometimes called a bug, is the identified or hypothesized cause of a software failure. Software faults can be classified as (i) design faults and (ii) operational faults according to the phases of creation.

Error- An error is the part of the system state which is liable to lead to a failure. It is an intermediate stage in between faults and failures. Software faults are most often caused by design faults. Design faults occur when a designer, either misunderstands a specification or simply makes a mistake.

Failure- A failure mode is an identifiable weakness in the system design and manufacture. Failures can be classified into severity classes, e.g. critical, major, and minor. A failure occurs when the user perceives that a software program is unable to deliver the expected service. A fault-tolerant system may be able to tolerate one or more fault-types including:

  • Transient, intermittent or permanent hardware faults
  • Software and hardware design errors
  • Operator errors
  • Externally induced upsets or physical damage

The Transition of Fault, Error and Failure in a Software Life Cycle

Figure 2: The Transition of Fault, Error and Failure in a Software Life Cycle

Fault tolerance system is a vital issue in distributed computing; it keeps the system in a working condition in subject to failure. The most important point of it is to keep the system functioning even if any of its part goes off or faulty. For a system to be fault tolerant, it is related to dependable systems. Dependability covers some useful requirements in the fault tolerance system these requirements include: Availability, Reliability, Safety, and Maintainability.

  • Availability: This is when a system is in a ready state, and is ready to deliver its functions to its corresponding users. Highly available systems works at a given instant in time.
  • Reliability: This is the ability for a computer system run continuously without a failure. Unlike availability, reliability is defined in a time interval instead of an instant in time. A highly reliably system, works constantly in a long period of time without interruption.
  • Safety: This is when a system fails to carry out its corresponding processes correctly and its operations are incorrect, but no shattering event happens.
  • Maintainability: A highly maintainability system can also show a great measurement of accessibility, especially if the corresponding failures can be noticed and fixed mechanically.

As we have seen, fault tolerance system is a system which has the capacity of or to keep running correctly and proper execution of its programs and continues functioning in the event of a partial failure. Some of the fault is narrowed down to Hardware or Software Failure (Node Failure) or Unauthorized Access (Machine Error).


[1] Arif Sari and Murat Akkaya, “Fault Tolerance Mechanisms in Distributed Systems”, International Journal of Communications, Network and System Sciences, 2015, Volume 8, pp. 471-482

[2] Kjetil Nørvåg, “An Introduction to Fault Tolerant Systems”, 2000.

[3] Sulekha Rani “Software and Hardware Reliability of Fault Tolerant Systems”, Ph.D., dissertation report, 2011.

[4] Elena Dubrova, “Fault Tolerant Design: An Introduction”, Kluwer Academic Publishers, Stockholm, Sweden

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview