This work was done from August 2000 to December 2001 for my MS thesis titled
Design and Evaluation of a Distributed Diagnosis Algorithm for Arbitrary Network Topologies in Dynamic Fault Environments. To explain its contents in layman's terms, consider a network of computers or hosts or processors, which we will call
nodes. Any two nodes may not have a communication link directly connecting them (arbitrary network topologies), and the nodes can fail and recover (dynamic fault environment). The problem to be solved is - Each node must keep track of the current state (failed or working) of all other nodes in the network. This problem is known as the distributed diagnosis problem.
A synchronous system model is assumed, and the network is assumed to be reliable. The synchronous system model means that all computations and message transmissions take place in bounded time,. The fault model assumed for the nodes is the crash fault model. When nodes fail, they do not send or receive messages nor do they perform any computation. In essence, they could be considered to have been switched off. When they recover, they begin from a certain initial state regarding their views of other nodes in the network.
For nodes to keep track of the status of other nodes, messages must obviously be exchanged. Since each node is required to independently diagnose other nodes in an arbitrarily connected network, messages must be propagated in such a way that it can be guaranteed that a message sent by a node is received by all other working nodes in the network. This places some restrictions on the minimum amount of time a node must remain in a particular state (working or failed), called the
state holding time. In a dynamic fault environment, nodes must buffer messages for reliable propagation, and the size of these buffers must be kept bounded. In addition, due to the arbitrary-topology nature of the network, messages may arrive in duplicate or out of order, potentially causing nodes to incorrectly detect changes in state.
The thesis contains a distributed diagnosis algorithm that solves this problem. Precise bounds on the diagnosis latency and other parameters were derived, and the algorithm was thoroughly analyzed using discrete-event simulation techniques. My thesis can be downloaded
here. A paper based on this research appeared in the May 2004 issue of the IEEE Transactions on Parallel and Distributed Systems. The paper can be downloaded
here.