CS 626: Fault Tolerant Computing Systems

Course Contents:

The course will discuss the principles & practice of fault tolerance in software and distributed systems.

Some of the topics to be covered in the class are: system model - error, failure, faults, software fault tolerance, Byzantine agreement, fail-stop processors, stable storage, reliable and atomic broadcasting, process resiliency, data resiliency & recovery, commit protocols, reliability modeling & performance evaluation, crash recovery in databases, and voting methods.

Books and References:

P. Jalote.Fault Tolerance in Distributed Systems, Prentice Hall Inc., 1994.

Various research papers.