Distributed Computation Algorithms
The subject of distributed algorithms is vast, and actually includes parallel algorithms as a special case (at least those written for interconnection network models). While massively parallel machines implementing centrally controlled algorithms continue to be important (and this importance may grow as technology advances), more recent trends in concurrent computing environments (networked workstations, grids, clusters, the internet) have emphasized the need for distributed algorithms that do not necessarily assume a central control. In its most general form, distributed algorithms are simply algorithms implemented on a collection of (possibly) heterogeneous processors connected in some sort of network, and working on problems using various degrees of synchronization.
In recent years the notion of a “computational grid” and concomitant “grid computing” have emerged. A computational grid can be viewed as an infrastructure of multiple CPUs distributed over a local or wide area network, that appears to an end user as one large computing resource. In general, there is no central control assumed in a computation grid, and the end-user may or may not directly control the computing resources in the grid. Examples of government-sponsored computational grids include NSF’s TeraGrid, NASA’s Information Power Grid or DOE’s Science Grid. Open-source software packages like the Globus Toolkit are available to give the end-user a programming environment that utilizes a large computational grid. Commercial vendors are also rapidly entering the grid computing scene with software packages that enable businesses to run their applications over private-sector computational grids.
In one model of grid computing, the processors might be a (possibly dynamically changing) collection of thousands of widely separated PCs or workstations connected via the internet whose idle cycles are captured and utilized to process data in a purely asynchronous fashion. An example of the use of this model of grid computing is the Searching for ExtraTerrestrial Intelligence (SETI) project that captures the idle cycles of thousands of processors to analyze radio telescope data looking for signs of intelligent life in the universe. A contrasting model of grid computing consists of a set of (usually homogeneous) processors making up what is known as a Beowulf cluster, typically rack-mounted in a single physical location (see Figure 18.1) , and connected by gigabit Ethernet or more specialized network systems such as myrinet. As opposed to projects such as the SETI project where standalone processors are available for the SETI project only when idle, processes in a Beowulf cluster are solely dedicated to working on problems hosted by the cluster.
Beowulf Clusters in the LINC Lab (Laboratory for Integrated Networked Computing), University of Cincinnati
***Figure 18.1 (new)***
In this chapter we introduce distributed algorithms by focusing on the special case of processors communicating with one another using message passing. The message passing paradigm will capture most of the relevant issues in the theory of distributed algorithms, at least from a high level point of view. A communication step (sending or receiving a message) between processors will be assumed to be far more time consuming than a computation step performed locally by a processor. Hence, similar to our study of parallel algorithms on interconnection network models, we must take communication complexity into account when analyzing the performance of a distributed algorithm. Moreover, it will now be more difficult to measure communication complexity, since in a communication step in the message passing environment there may not be tight synchronization between the sending and receiving of a message. In any case, the following key fact always holds.
Key Fact. Good speedup in the distributed algorithm environment will invariably require computation steps to dominate communication steps.
In contrast to our study of parallel algorithms, we now will assume that the input to our algorithm also includes the number of processors p that will be fixed in the sense that p is independent of the input size n to the problem. This assumption is made to reflect the situation actually encountered in practice, where the number of processors p is typically of much smaller order than the input size n to an algorithm. Making the number of processors independent of the input size also reflects today’s commonly encountered application domain where an algorithm is implemented on a Beowulf cluster using the MPI library of functions (see Appendix F). The MPI environment specifically requires the programmer to specify, (for example, at the command line), the number of processes to be utilized by the program. It is possible to have a processor run more than one process, so we will typically use the term “process” instead of “processor” in the remainder of this chapter. In fact, you can run an MPI job using p > 1 processes on a single processor, but, of course, no actual parallelism is then present. However, running on a single processor can be useful for debugging and testing, or in for instructional purposes in environments where a multiple-processor cluster is not available.
18.1 A Distributed SPMD Programming Model
Of the many different models of distributed computing that could be considered, in this text we will focus our attention on a model that is somewhat complementary to the SIMD model of parallel computation. In particular, we will not assume that processes are executing instructions in a tightly synchronized lock-step fashion under the guidance of a central control process. We will concentrate on distributed algorithms suitable for implementation on a network of (possibly) heterogeneous processes, which typically are executing asynchronously, and which communicate with one another by message passing. This does not mean that some synchronization might not be required, but it does mean that synchronization must be explicitly implemented algorithmically, as opposed to the SIMD parallel model where tight synchronization is largely a consequence of manner in which single instructions are sent to all the processes by central control. In fact, we will assume a Single-Program-Multiple-Data (SPMD) model of distributed computation, where each process has a copy of the entire code to be executed. Thus, our pseudocode for distributed algorithms will actually look just like sequential code, with the exception of the utilization of some built-in functions that we will introduce in support of message passing. (Our pseudocode resembles MPI code, with each of our pseudocode functions has a closely related MPI function.) However, each process has a unique ID, so that parallelism is achieved by program logic that utilizes these IDs. Not only do these IDs allow processes to communicate with one another, process IDs can also be used in the program to enable a given process to execute a given block of code (or not) in a manner dependent on its ID. In our pseudocode, we will assume that each process can determine its ID by calling the built-in function myid. Typically, the value of myid will be assigned to the variable myrank, since MPI refers to process IDs as ranks.
Of course, as in the SIMD model, each process has its own assigned portion of the data of the input to the problem. While the SPMD model actually allows (via a large case construction) each process to execute entirely different code (that is, a MIMD-type scenario), in most of our examples processes all execute identical code on their individual data portions (but not necessarily synchronously). The one major and very important exception is in the master-worker scenario, in which one process is singled out as the master, which executes code that supervises and coordinates the action of all the other processes, called workers. Typically the workers all execute identical code (but, of course, on their local versions of the data). Other names for the master-worker model commonly used are manager-worker, supervisor-worker, master-slave, and so forth. While the worker code is identical for each worker, workers usually execute their code relatively independently of one another, in a largely asynchronous manner.
18.2 Message Passing
We will assume that the processes in our distributed computing model communicate with one another via message passing. Thus, the basic communication functions are send and receive with appropriate parameter lists for identifying sender and receiver, as well as the data to be transmitted. Actually, our send function will have two variants, Ssend and Bsend, depending on whether or not the message passing is synchronous (non-buffered and blocking) or asynchronous (buffered and non-blocking). When a process executes an Ssend instruction, further execution of its code is blocked until the corresponding receive has been acknowledged. On the other hand, when a process executes a Bsend instruction, the message is stored in a buffer (the “B” in Bsend stands for “buffered”), and then the process is not blocked and can continue further execution of its code. The actions of Ssend and Bsend are further explained in subsections 18.2.1 and 18.2.5, respectively. In this section we simply use send to denote both variants. For our purposes, and to keep things simple, we will only use a blocking receive instruction, although MPI has a non-blocking version as well.
Assume that we have p processes with IDs simply given by 0,1, …, p – 1 (so that the processes P0, P1, …, Pp-1 are indexed by their IDs in our discussions). We start the indexing with zero since this is the MPI standard method of assigning ranks to processes. Recall that we assume that each process can determine its ID by invoking the myid built-in function. Our distributed pseudocode for (both variants of) the send instruction are of the form
send(data,dest,tag)
where the parameter data is the data to be sent (which can be an array or a more complicated data structure), process Pdest is the destination (target) of the message, and tag contains information that may be used for such things as providing additional criterion for determining a match by a receive instruction, or to provide information on the type of action that should be taken upon receiving the message. While tag is always included in the parameter list of the send instruction, it might not contain any useful information, in which case it can be ignored by the receiving process.
Pseudocode for the receive instruction has the form
receive(data, src, tag, status)
where the parameter src refers to the (source) process Psrc that should at some point issue a send instruction that will match the receive instruction. The special wildcard variable ANY_SOURCE can be used as the second parameter, which (assuming the values of the tag parameters match) allows a receive instruction to match a send from any process (wildcards are not allowed in the send instruction). The third parameter tag plays a similar role to the tag parameter in the send instruction. Similar to ANY_SOURCE, a special wildcard variable ANY_TAG can be used for the tag parameter of the receive instruction in order to match the tag parameter of any send message. The last parameter is status, which as an output parameter pointing to a structure that contains two fields – source and tag. Then status→source is the ID of the process that was the source of the message received, and status→tag is the parameter tag of message sent by the source process. The following key fact summarizes the matching situation between send and receive instructions when specific processes are used for the parameters src or tag in the receive instruction (as opposed to using ANY_SOURCE or ANY_TAG).
KEY FACT. If a specific value src is used as the second parameter in a receive instruction, then a send message will not match this receive unless it comes from the process whose ID is src. In particular, the value status→source given by the fourth parameter status gives no additional information, since it will have the value src if the message is matched and received. Similarly, if a specific value tag is used as the third parameter in a receive instruction, then a send message will not match this receive unless the value of its parameter tag matches the value tag of the receive instruction. In particular, the value status→tag given by the fourth parameter status gives no additional information, since it will have same value as tag of the source send message if the message is matched and received.
Our pseudocode has been chosen to be consistent with the corresponding send and receive commands MPI_Ssend, MPI_Bsend and MPI_Recv of the MPI library (C version), but has been simplified in order to highlight the ideas without the encumbrance of the additional detail utilized by MPI. For example, MPI_Ssend has six parameters and MPI_Recv has seven. The translation from our pseudocode to MPI code is straightforward, and will be illustrated later in Appendix F. Note that each process can send a message directly to any other process without the need to worry about routing. In this sense, our distributed model is analogous to an interconnection network modeled on a complete graph. However, there are issues relating to synchronization that must be considered, and even the dreaded deadlock scenario is possible. Thus, we must go into some detail concerning the protocol relating to our send and receive functions and their variants.
18.2.1 Synchronous (Blocking) Message Passing
We will assume that our Ssend instruction is what is known as a synchronous or blocking send, which does not return until the matching receive instruction has actually received the corresponding message. In other words, code execution cannot proceed to the next instruction following a Ssend until the data has actually been received by the matching receive. Similarly, our receive instruction is blocking in the sense that it does not return until the message corresponding to the matching send has actually been received. Thus, these functions enforce highly synchronous message passing. More precisely, when a Ssend is executed, the target process is notified that it has a pending message. This “posting” of a pending message is done in the background, and does not interrupt the execution sequence of the target process. If the posting of a Ssend message occurs before the matching receive is encountered by the target process, the source process must block (suspend) execution until the matching receive is executed by the target. When executing a receive, the target process checks to see if a matching Ssend has been posted. If not, the target must block (suspend) its execution sequence until such a posting has been made. Given such a posting, the target then acknowledges the posting to the source, and the source process then transfers the message to the target. This synchronization is illustrated in Figure 18.2, which shows how Ssend and receive functions must suspend their respective execution sequences until the message passing has been completed.