Articles
- Main
Computer Security
- Detecting compromised servers in quorum systems
Fault Tolerance
General / Other
Detection of Compromised Servers in Byzantine Quorum Systems
In this work, the problem of detecting compromised data storage servers was considered. The setup used was a set of servers that is used for data storage and retrieval. Data is replicated at the various servers (as opposed to coding schemes). The first part of this writeup explains Byzantine quorum systems, while the latter half explains the exact problem and the solution developed.
To read and write data, a user accesses a subset or quorum of all the servers present. Since the reads and writes take place to subsets or quorums of servers instead of all the servers, it is possible to get outdated responses during reads. That is, some servers will have outdated copies of a data object as they may not have been written to in the last write to that data object. If servers are assumed to be vulnerable to security attacks, then compromised servers can even fabricate arbitrary responses. To counter this, first an upper bound on the number of faulty or compromised servers is assumed. Denote this upper bound, also known as the fault threshold, by the letter b. Next, timestamps must be used to distinguish old and new values. The value read will be the response returned by at least b+1 servers with the same highest timestamp. The quorums chosen for reads and the quorums chosen for writes must overlap in at least 2b+1 servers for this approach to work. This technique is called Byzantine Quorum Systems. An enormous amount of literature exists on different variations and optimizations of this technique.
To explain it with an example, consider a system consisting of 12 servers, denoted by S1, S2, ..., S12. Lets say the current fault threshold is b=2, and that reads and writes take place to quorums or subsets of size 9 servers. Now say a user writes a new value to a data object V to servers S1, S2, ..., S9. When the user wants to retrieve the data later, he reads from (say) servers S4, S5, ..., S12. Servers S4 to S9 will return the value that was last written to V, while servers S10, S11, and S12 will return outdated responses.
Amongst the servers S4 to S9, a maximum of two servers can be compromised because the fault threshold is assumed to be two. Assume S4 and S5 are the faulty or compromised servers, and say they return the outdated value returned by servers S10, S11, and S12. The user will now have servers S4, S5, S10, S11, and S12 return responses with a certain timestamp, and servers S6, S7, S8, and S9 return responses with a higher timestamp. Note that four servers (S6, S7, S8, S9) return responses with the highest timestamp, while the fault threshold is only two. So at least one server in (S6, S7, S8, S9) is not faulty, and the response returned by it is legitimate. Since such a response has the highest timestamp, it must be the one that was last written by the user.
That explains in a nutshell how Byzantine quorum systems work. The precise problem tackled in this work is how to detect a faulty or malicious server that returns incorrect responses, perhaps occasionally. Since writes do not take place to all the servers, there is a good chance that even fault-free or uncorrupted servers return incorrect (outdated) responses.
Two modifications to the system setup are done to tackle this problem. First, users do not directly contact a quorum of servers; instead, they contact an arbitrarily chosen server, called the proxy server, and the proxy servers forwards the requests / responses to the servers / users. This allows proxy servers to monitor read responses and thus detect which servers returned incorrect responses and which ones returned correct responses. The second modification to the system model is the inclusion of a secure, fault tolerant, dedicated server called the diagnosis node. The diagnosis node receives diagnostic reports from the proxy servers periodically, and it runs a fault detection algorithm to determine which servers are faulty and which are not. Based on the decisions, the faulty servers could be removed or the fault threshold could be increased, or any other appropriate action can be taken.
At the end of a read, a proxy server can determine which servers returned correct responses and which ones did not, using the same algorithm a user would follow. This is monitored over a sequence of read operations. Over a given number of consecutive read operations, fault-free servers are expected to give only a certain fraction of incorrect responses, since they do not participate in all writes. Servers that give a considerably larger number of incorrect responses are determined to be faulty servers, and reported to the diagnosis node.
Since all these are probabilistic techniques, there is a certain probability that proxy servers incorrectly determine another data server to be compromised. Worse, a malicious proxy server can blindly accuse other data servers as compromised. The diagnosis node therefore employs a voting mechanism to determine if a suspected data server is indeed compromised.
A paper based on this work appeared in the 22nd International Symposium on Reliable Distributed Systems (SRDS), October 2003, which can be downloaded here. The paper gives the complete mathematical treatment to the above algorithm, and also gives the read and write protocols that take into account the removal of faulty or compromised servers. Simulation results bringing out the benefits of this approach is presented. It is also shown via simulations that the above described fault detection algorithm can tolerate some degree of concurrency between reads and writes, which has the potential to skew the fault detection algorithm.
This page was last modified on Thursday, December 27, 2007.