Articles
- Main
Computer Security
- Detecting compromised servers in quorum systems
Fault Tolerance
General / Other
Distributed Diagnosis for Arbitrary Network Topologies in Dynamic Fault Environments
This work was done from August 2000 to December 2001 for my
MS thesis titled Design and Evaluation of a Distributed Diagnosis Algorithm for
Arbitrary Network Topologies in
Dynamic Fault Environments.
To explain its contents in layman's terms, consider a
network that connects several computers or hosts or processors, which we
will call nodes.
Any two nodes may not have a communication link directly connecting
them (arbitrary network topologies), and
the nodes can fail and recover (dynamic fault environment).
Each node must keep track of the current state (failed or working) of all other nodes in the network,
and this problem is known as the
A synchronous system model is assumed, and the network is assumed to be reliable. This means all computations and message transmissions and such take place in bounded time, and the network links never fail. The fault model assumed is the crash fault model. When nodes fail, they do not send or receive messages nor do they perform any computation. In essence, they could be considered to have been switched off. When they recover, they begin from a certain initial state regarding their views of other nodes in the network.
In such a model, a definition for the correctness of a distributed diagnosis algorithm is required. This was developed by my thesis advisor, Dr. Douglas Blough. For nodes to keep track of the status of other nodes, messages must obviously be exchanged. Since each node is required to independently diagnose other nodes in an arbitrarily connected network, messages must be propagated in such a way that it can be guaranteed that a message sent by a node is received by all other working nodes in the network. This places some restrictions on the minimum amount of time a node must remain in a particular state (working or failed), called the state holding time. In a dynamic fault environment, nodes must buffer messages for reliable propagation, and the size of these buffers must be kept bounded. In addition, due to the arbitrary-topology nature of the network, messages may arrive in duplicate or out of order, potentially causing nodes to incorrectly detect changes in state.
The thesis contains a distributed diagnosis algorithm that meets the correctness criteria. Precise bounds on the diagnosis latency and other parameters were derived, and the algorithm was thoroughly analyzed using discrete-event simulation techniques. My thesis can be downloaded here. A paper based on this research appeared in the May 2004 issue of the IEEE Transactions on Parallel and Distributed Systems. The paper can be downloaded here.
This page was last modified on Thursday, December 27, 2007.