Unreliable Failure Detectors

LEINWEBER

EECS454

Project Report

1 of 13

Unreliable Failure DetectorsforReliable Distributed Systems

This paper by Tushar Deepak Chandra and Sam Toueg analyzes and solves some problems with distributed systems, in which a network of computers is organized to work together. The strength of a distributed system, over a single powerful computer, is that the distributed system can continue to function even if some processors have failed or become inaccessible; however, this strength can be exploited only if there are algorithms to identify and weed out the failed processors. The dilemma comes with the requirement that any processor failure can be accommodated. It would be relatively simple to appoint a supervisor to identify failed machines, but then the supervisor becomes the weak link. So these distributed systems algorithms treat all machines as equal peers and carry on reliably in spite of a number of failed processors.

Two-Army Problem

The introduction to the concepts of distributed systems and reliability often begins with a description of the classic two-army problem. In this military analogy, two allied armies, the blue armies, are camped on hills separated by a valley. The opposing red army is camped in the valley. Each blue army has 3,000 soldiers while the red army has 5,000. If the blue armies coordinate their attacks, they will win. If they attack separately, the red army wins. The blue army on the first hill has a scout on a horse who can deliver messages to the army on the other hill, but to get there, he must pass through the valley occupied by the red army, so there is a significant risk in sending a message that it will not get through.

To coordinate their attack, of course, the first blue general sends a message to suggest an attack plan to the blue general on the other hill. Until the scout returns, the first general cannot be sure that the second blue general agreed to the plan. Even then, the second general cannot be sure that the first general received his reply. The scout could continue to relay confirming messages, but the last general who sent him could never be sure he got to the other general.

In the analogy, the blue armies represent two processors and the path through the valley represents the communication channel between the processors. The scout embodies the message. The moral of the story is that there is no way to absolutely guarantee agreement between processors via an unreliable communication channel.

The model used in this paper assumes reliable channels but unreliable failure detectors, which may be equivalent. Whether a processor fails or the channel to a processor fails, the result is the same in that the processor cannot be relied upon by the processor at the other end of the channel. The two-army problem is used to here to introduce the concepts and problems associated with reliable distributed systems in the presence of some unreliable components.

Byzantine Generals Problem

The other classic introductory analogy for distributed systems is another military scenario. In this problem, a group of Byzantine generals has telephone communication so each general can send a message to each other general. Again, the goal is to coordinate an attack against an enemy, but now there are traitors who will lie or mislead the loyal generals. This is not a logic puzzle, but the model is that each general will report troop levels, accurately or not, and the loyal generals must reach consensus on the number of soldiers that the loyal generals command.

The original version of this problem was solved with a commander and a group of lieutenants, so the officers were not all peers. The strategy was that each officer would report his own troop level number to every other officer. Then in a second round, each officer would relay all the reports from the first round to each other officer. The reason for the second round was that if an officer lied and gave inconsistent reports in the first round that the inconsistency would be found as a contradiction in the second round. Also if an officer lied in the second round by falsifying the reports he received in the first round, this type of lie would also be caught if the majority of reports each officer received were correct. So the solution requires a majority of loyal officers.

The size of this majority is widely reported as requiring a total of 3n + 1 officers if there are n traitors. That would seem to provide a loyal officer with 2n true reports and n false reports about each loyal officer, which is more than enough. A loyal officer needs only n+1 true reports to outvote n false reports, so a total of 2n + 2 officers seems sufficient.

In this model, the officers represent the processors and the telephone lines are the communication channels. Here, the channels are reliable, but the processors are unreliable, even malicious. In fact, the solution seems to require that the defective processors misbehave. If a traitor reports his troop level consistently but inaccurately, he will seem loyal. The analogy to distributed systems is used to solve the problem of consensus, where a result needs to be agreed on. It is outside the scope of the problem to determine if the value is ultimately correct.

The Byzantine Generals Problem considers the problem of unreliable processors in a distributed system, but the paper does not directly address this either. The paper considers processors that crash and assumes that they stay crashed, that they never recover. But the Byzantine Generals Problem introduces the idea of using reports about other processors as votes and making decisions based on a majority of received reports. This forms the basis of many distributed systems solutions to the Consensus problem.

Distributed Systems

A distributed system may be defined as a group of processors connected by communication channels. The paper calls these processes, but that term implies a shared processor, in which environment the mechanism for communication within a process and between processes is the same. A distributed system is more typically a group of independent processors that communicate via a network and that work together to solve a problem. The goal of this organization is to provide a system robust against any local failure, to provide local access over a wide area, and to make efficient use of physically separate resources.

Types of Failure

Distributed systems suffer from two types of failure, communication problems, as illustrated in the two-army problem, and processor problems, as illustrated in the Byzantine generals problem. Unreliable communication includes messages that are lost or corrupted. Processor problems include crashing, unresponsiveness for long periods and deliberate mischief.

These problems are solved with two classes of algorithms called Atomic Broadcast and Consensus. To provide a model of reliable communication in an environment of unreliable components, Atomic Broadcast provides all working processors with the same set of messages in the same order so that all processors are synchronized with the same information. To cope with unreliable processors, Consensus provides a way for all working processors to agree on a value, and by extension, all the data the distributed system needs.

Scope of This Solution

The solution discussed in this paper is based on a particular model of distributed systems, which is limited in several ways because the general problem is not solvable. Firstly, the only type of failure that is considered is a crashed processor that stays crashed. There is no provision for a processor to recover. Also there is nothing about a processor behaving maliciously, behaving correctly in some circumstances and incorrectly in others to the detriment of the well-behaved processors. Although the paper does not address this type of failure, the algorithms may be robust enough to work regardless of the how the processors fail.

The paper assumes communication channels are reliable but allows for failed processors. In some cases, an unreliable channel may be equivalent to a failed processor in that the other processors should respond in either case by not relying on the failed processor.

According to the paper, the processors may run asynchronously, using their own local clocks, for example, but the processors must have a way to synchronize eventually, in some finite period of time, so there is no issue of metastability or need to distinguish between a processor that has actually crashed and a processor that is unresponsive indefinitely.

Finally, there are two particular requirements in the paper. At minimum every processor that has crashed is detected as such by at least one working processor. Also there must be at least one working processor that is recognized as such by all working processors. These are very peculiar requirements that are the linchpins of the theories presented in the paper.

Failure Detectors

The model used in the paper is characterized by the concept of an unreliable failure detectors, which is a module attached to a processor that attempts to inform it about the crash status of some other processors. The paper does not require that every failure detector be connected to every other processor or that every processor have communication channels to every other processor. This seems to cause a great deal of complexity in the algorithms. But it is assumed that either the working processors can reliably relay information, and thereby spread such information by diffusion, or that any processors in a subnetwork isolated by failed processors between are effectively failed as well. So there is no need to consider issues of partial networks. A fully connected network topology, with each processor in direct contact with every other, will be assumed.

Each processor has a failure detector with which to determine the crash state of each other processor, but this information is unreliable and the model in the paper allows a processor's knowledge of the crash state of another processor to change repeatedly. The paper calls these processors "suspected" of crashing, but here the term "abandoned" will be used, since suspecting a crash is a condition that is used to only to one purpose in the algorithms presented, which is to abandon communication with that processor. It is more correct to describe the condition "to suspect" than the response "to abandon" but since the purpose here is to understand these algorithms, the condition is needlessly broad because the response is always the same.

Since each processor has a failure detector to tell it when to abandon each other processor, there is really no need to identify the failure detector as distinct from the processor. The processor simply has knowledge, albeit unreliable knowledge, of the crash state of each other processor. But an important focus of the paper is to catalog various classes of failure detectors by their reliability so the concept of failure detectors is maintained here.

The paper also refers to processes that have crashed and processes that have not crashed. For simplicity, these will be called down processors and up processors.

Completeness & Accuracy

Two terms are defined about the reliability of failure detectors as completeness and accuracy. Completeness is that a down processor is abandoned. Accuracy is that an up processor is not abandoned. It is obvious that processors are abandoned or not abandoned by up processors, since a down processor cannot do anything.

The term "completeness" seems to be named from the point of view of abandoning down processors even if too many are abandoned. Making sure too many are not abandoned is the need for "accuracy." Clearly, completeness and accuracy are propositional inverses here; unfortunately, these terms do not convey the dichotomy.

Function Definitions

The paper introduces a great deal of notation, much of which does not seem to clarify the problems or their solutions. There are only a couple of properties at issue: whether or not a process is abandoned because another processor thinks it is down, and whether or not it is actually down.

These conditions can be described with two functions. abandons(p, q, t) is a Boolean function that returns true if processor p abandons processor q at time t. And isDown(q, t) is a Boolean function that returns true if in fact processor q is down at time t. There are discrete units of time. These functions are simplified versions of the notation used in the paper.

Completeness

As previously stated, completeness is that a down processor is abandoned. The paper gives two versions of this property. Strong completeness is that every down processor is abandoned by every up processor eventually. Weak completeness is that every down processor is abandoned by at least one up processor eventually. That is, there exists some particular up processor that abandoned all down processors. If it were known which up processor was that one, the algorithms would be trivial, since all processors could rely on it. The paper does not require that information, but this is an indication of how weak the algorithms are.

These properties can be stated algebraically. The concept of "eventually" is that there is a time t0, after which the property is always true. The only difference between strong and weak completeness is whether all or any processes abandon the down processes. Otherwise, completeness is simply that if a process is down, it is abandoned.

Strong Completeness:

Weak Completeness:

Accuracy

The inverse of completeness, accuracy is that an up processor is not abandoned. The paper describes a weak and a strong accuracy, as with the two types of completeness. For accuracy, there is a further distinct between "perpetual" and "eventual" accuracy, making four combinations of accuracy in total. So strong perpetual accuracy is that every up processor is not abandoned by any processor ever. Strong eventual accuracy is that every up processor is not abandoned by any processor eventually. Weak perpetual accuracy is that at least one up processor is not abandoned by any processor ever. And weak eventual accuracy is that at least one up processor is not abandoned by any processor eventually.

Strong Perpetual Accuracy:

Strong Eventual Accuracy:

Weak Perpetual Accuracy:

Weak Eventual Accuracy:

Classes of Failure Detectors

With two types of completeness and four types of accuracy, the paper catalogs eight classes of failure detectors, but they are just the combinations of three two-state properties.

Strong Perpetual Accuracy / Weak Perpetual Accuracy / Strong Eventual Accuracy / Weak Eventual Accuracy
Strong Completeness / P / S / 
P / S
Weak Completeness / Q / W / Q / W

Reducibility (Emulation)

Having defined eight classes of failure detectors, the paper uses a common simplification technique in computer science in preparation for proving algorithms, first proving that some of these apparently distinct properties are functionally equivalent.

In the case of the failure detectors, it is clear that a strong failure detector is at least as capable as a weak detector, in completeness or accuracy. Weak detectors require the property hold for at least one processor while a strong detector must hold for all processors. Certainly if a property holds for all, it holds for at least one.

Conversely, and unexpectedly, a weak completeness detector can emulate a strong completeness detectors because up processors can communicate with one another. Of course down processors cannot communicate, but since down processors cannot do anything, their inability to communicate does not diminish their capability. Since up processors can share information, all processors can accumulate lists of abandoned processors. Eventually a processor abandoned by one can become abandoned by all. As the up processors share this information, any processor that is actually down will become abandoned so the communication is eventually completed. Since abandoning all processors abandoned by any processor will err on the side of abandoning too many, this technique allows weak completeness to emulate strong completeness.

Completeness Classes Are Equivalent

Since weak and strong completeness detectors can emulate each other, these are called reducible to each other and these are equivalent so that if a property is proved about either one, the property is true of the other. Only strong completeness is considered further since its greater capability makes properties concerning completeness easier to prove.

This leaves only four distinct classes of failure detectors, distinguished by their accuracy property: P, strong perpetual; S, weak perpetual; P, strong eventual; and S, weak eventual.

Relationship of Accuracy Classes

Since strong is more powerful than weak, a strong accuracy failure detector can emulate a weak accuracy detector. Similarly, a perpetual accuracy failure detector can emulate an eventual accuracy detector. This is simple to prove because eventual requires that there be a time, t0, after which the property is always true. Perpetual must always be true, so if t0 is chosen arbitrarily, perpetual fulfills the requirement for eventual.

Relationship of Failure Detector Classes

The most powerful failure detector is P, strong perpetual accuracy, which is more powerful than S, weak perpetual accuracy, and P, strong eventual accuracy. The least powerful failure detector is S, weak eventual accuracy, which is less powerful than P, weak perpetual accuracy, and S, strong eventual accuracy.

These relationships mean that a property proven for the least powerful, S, is true of a system where all the processors use any failure detector of the eight discussed in the paper. More powerful failure detectors are required for some algorithms but if, as will appear later, a property can be proven for S, the property is also true for P.

The Consensus Problem

With the properties of certain failure detectors well established, the paper turns to the algorithms that can be carried out. The consensus problem is the prototypical problem for distributed systems. In this problem, the processors still running must reach agreement on a value in spite of some failed components. One value can represent an arbitrarily large set of data and done repeatedly can represent all the data sharing needs of a distributed system. So consensus on one value is the fundamental problem.