The course will discuss the principles & practice of fault tolerance in software and distributed systems.
Some of the topics to be covered in the class are: system model - error, failure, faults, software fault tolerance, Byzantine agreement, fail-stop processors, stable storage, reliable and atomic broadcasting, process resiliency, data resiliency & recovery, commit protocols, reliability modeling & performance evaluation, crash recovery in databases, and voting methods.
P. Jalote.Fault Tolerance in Distributed Systems, Prentice Hall Inc., 1994.
Various research papers.