Mapping MPI to Machine: Implementing the MPI Standard

Implementing the MPI Standard

CSE 6306 Advanced Operating Systems - Research Paper

The University of Texas at Arlington

Satya Sagar Salapaka

06 April 1999

INTRODUCTION

Message Passing is a programming paradigm used widely on parallel computers, especially Scalable Parallel Computers (SPCs) with distributed memory, and on Networks of Workstations (NOWs). In a remarkably short time, the Message-Passing Interface (MPI) has become what it was intended to be a de facto standard Application Programming Interface (API) for message-passing parallel programming. Beginning in 1992 a group of researchers, academicians, and vendors met to quickly create a portable, efficient, and full-featured standard. All meetings and discussions of this Message Passing Interface Forum were open and accessible over the Internet. They met their goals --the first version of the standard was released in May 1994, revised in June 1995, and it’s second version released in July 1997 and MPI has become the API of choice for most new parallel programming projects. The focus of this paper is the problems, issues, and solutions that pertain to implementing the MPI standard on several representative computer architectures and operating systems. This paper is based on the work done by John M. Linebarger of The University of New Mexico.

MPI

Reasons for MPI were legion. Existing message passing libraries were mostly vendor-proprietary, restricted to a limited range of platforms. Those that weren’t were usually research-oriented, and thus not designed to be a production standard or even necessarily backward compatible with earlier releases. The feature sets of message-passing libraries were often incomplete; more annoying still, the gaps were often in different areas between libraries.

MPI addressed this situation by providing a feature-rich portable message-passing standard that would allow for efficient implementation on a wide variety of platforms. Both point-to-point and collective communication are supported by MPI, as are user-defined data types, virtual process topologies, and ANSI C and Fortran 77 language bindings. Multiple architecture and programming models are embraced by the MPI standard, including the SPMD and MPMD parallel programming models, distributed and shared memory architectures, and networks of heterogeneous workstations. A rich variety of communication modes are provided for sending messages, which exist, in both blocking and non-blocking versions. A particular contribution of MPI is the concept of a communicator, which is an abstraction that combines communication context and processor group for better security and separation in message passing.

IMPLEMENTING MPI

Numerous implementations of MPI exist, both public domain and commercial. The most important of the public domain implementations is MPICH (MPI/CHameleon), developed by Argonne National Laboratories. Other portable public domain implementations include Local Area Multicomputer (LAM) from the Ohio State Supercomputing Center and Common High-level Interface to Message-Passing (CHIMP/MPI) from the Edinburgh Parallel Computing Center (EPCC). Commercial implementations are available from several supercomputer and workstation vendors, including IBM, Cray Research, Intel, Silicon Graphics, and Hewlett-Packard.

Figure 1: A Conceptual View of MPI Implementation

As depicted above, the conceptual goal of implementation is to map MPI to the machine. In general, the MPI standard and the facilities provided by the target platform represent fixed, given quantities; the job of the implementor is to make the best of what is provided (in terms of operating system and hardware features) in order to fulfill the requirements of the MPI standard. The most efficient hardware and software primitives should be used to directly implement corresponding MPI calls, emulating in software those functions for which the vendor does not directly provide support.

In practical terms, however, most MPI implementations to date have been ports of the reference implementation, MPICH. The implementations surveyed for this paper fall into three categories: Ports of MPICH; adding MPI interfaces to existing message-passing systems; and “roll your own from scratch” MPI implementations.

Figure 2: MPICH Implementation Architecture

Since implementations of MPICH are so pervasive, a look at its architecture is in order. As indicated in the figure above, MPICH takes a two-layer approach to implementing MPI. All MPI functions proper are contained in the upper, device-independent layer. This layer accesses the lower, device-dependent layer through the Abstract Device Interface (ADI), which is implemented as compiler macros. The ADI provides four main functions: Sending and receiving, data transfer, queuing, and device-dependent operations. The job of the implementers is thus to tailor the lower, device-dependent layer to the target machine; the upper, device-independent layer remains virtually unchanged.

Since one of the goals of MPICH is to demonstrate the efficiency of MPI, several optimizations are included. One of them is optimization by message length. Four send protocols are supported. The short send protocol piggybacks the message inside of the message envelope. The eager send protocol delivers the message data without waiting for the sender to request it, on the assumption that the probability is high that the receiver will accept the message. The rendezvous protocol doesn’t deliver the data until the receiver explicitly requests it, thus allowing the setup time necessary to send large messages with high bandwidth. And the get protocol is used for shared memory implementations, where the receiver utilizes shared memory operations to receive the message.

COMMON IMPLEMENTATION ISSUES

Every MPI implementor wrestles with a number of common issues. Chief among them is message buffering, because of the enormous leverage it has on overall performance. The goal is to reduce (or even eliminate) the number of intermediate buffers required transmitting a message, thus avoiding the significant overhead of memory-to-memory copies. Related issues are optimization by message length, the mechanism to use to handle unexpected messages, and the often-unavoidable tradeoff between latency and bandwidth.

Other common implementation issues include reducing the number of layers and context switches to improve performance, handling non-contiguous data, making the best use of available communication coprocessors, resolving semantic clashes between vendor-supplied message-passing facilities and the MPI standard, overcoming asymmetric performance of vendor-supplied primitives, compensating for the lack of native support for collective operations, adhering to the spirit of the MPI standard at times instead of the letter, and even deciding how much of the MPI standard to implement.

IMPLEMENTATION CASE STUDIES

Though common issues are faced, each MPI implementation is essentially unique. In order to demonstrate the wide variety of implementation options, several case studies of MPI implementations on representative parallel processing architectures are presented below. These architectures include massively parallel processors using both distributed and distributed shared memory, networks of workstations, a hybrid architecture involving networked multiprocessors which uses shared memory at each node and distributed memory between nodes, and uniprocessors running a flavor of Microsoft Windows.

The Meiko CS/2 is a 64-node distributed memory parallel computer equipped with an Elan communication coprocessor. As part of a Master’s thesis project, a student at the University of California at Santa Barbara developed a custom implementation of MPI with the goal of achieving low latency. Optimization by message length was provided--an optimistic “transfer before match” strategy was used to decrease latency for small messages, and large messages tapped DMA hardware to increase the bandwidth of large message transfer once message tag matching was completed.

But the key observation in reducing message-passing latency was that message tag matching was one of its primary components. Two options were explored to reduce the impact. After analyzing the performance of the MPICH implementation for the Meiko CS/2, which used the 10 MHz Elan coprocessor to do the matching in the background, the thesis author decided to design his own implementation to task the node processor (a 40 MHz SPARC Viking 3) with the matching. The reasoning was that although assigning the matching chores to the Elan reduced the load on the SPARC processor, the much slower clock speed of the Elan might actually be increasing latency overall. Performance comparisons revealed that the best choice depended on the characteristics of the application, so a compile-time option was added to specify which processor should be used for message-tag matching.

The Meiko has rudimentary native support for collective communication, but its use is restricted to contiguous data transmission between contiguously numbered processors. Implementing MPI collective communication required the resolution of this semantic clash by packing and unpacking non-contiguous data in order to use the native Meiko collective communication widget; however, the communicator was required to consist only of contiguously numbered processors.

The Intel Paragon is another example of a massively parallel distributed memory supercomputer, with an architecture that supports more than 1800 nodes. At Sandia National Laboratories in Albuquerque, NM, implementing MPI on the Paragon tapped the unique features of Puma, an operating system for parallel processors that is optimized to provide flexible, lightweight, high-performance message-passing. Portals are the structures used in Puma to realize message passing; a portal is an abstraction for a wrinkle in space, an opening in user process address space into which messages can be delivered. In essence, Puma uses user process address space as the buffer space for message transmission, avoiding gratuitous message copies. Thus the very architecture of Puma itself is designed to avoid two of the primary obstacles to message-passing performance: Kernel context switches and memory-to-memory copies.

Portals were used as the foundation for the MPI implementation on the Paragon under Puma. Starting with MPICH as the base implementation, point-to-point message passing was built directly on top of portals, with minimal assistance (evidently to reduce path length) from the Puma portals library. Optimizations were performed by message length, with the goal of low latency for short messages and high bandwidth for long ones; short, long, and synchronous message protocols were provided. MPI collective communication operations were mapped to native Puma collective functions, which are themselves built on top of portals. In addition, one-sided communication (a.k.a. “remote memory addressing,” which enables writing into and reading out of the memory of a remote process without its direct involvement) was implemented in terms of reply portals, in anticipation of the MPI2 standard (see below). Note that this particular approach implemented MPI almost exclusively in terms of operating system features, deferring hardware issues to the implementation of Puma itself.

The Cray T3D is a massively parallel supercomputer supporting up to 2048 processors, but with a different architecture than the machines previously described. The T3D is a physically distributed shared memory computer; memory exists on each local node, but is globally addressable. From a programming perspective, the availability of shared memory simplifies implementation because standard shared memory operations can be used to transfer messages. However, maintaining cache consistency at each processor is the responsibility of the application.

One MPI implementation for the T3D has taken the following approach. Using MPICH as the base distribution, a message header buffer was implemented as a collection of arrays on the shared heap. The common optimization for long and short messages was provided. The message transfer mechanism chosen was particularly novel. Performance profiling revealed an asymmetry in the performance of the vendor-supplied shared memory operations; specifically, put (shmem_put) outperformed get (shmem_get) by nearly a factor of two. As a result, an entirely put-based message transfer mechanism was designed; in other words, sending and receiving was accomplished by a sender push, not a receiver pull.

Several technical problems created by the choice of a put-based message transfer mechanism had to be resolved. The sender bore the burden of validating the delivery address in advance, since the destination address was usually local to the receiving processor, not a global address. The receiver cache was automatically invalidated upon message delivery, in order to maintain cache coherency. And temporary buffers had to be used to transfer data that was either not aligned properly or whose size was not a multiple of four, because shmem_put() is limited to transferring data that is four-byte aligned.

Using shared or distributed memory does not represent a mutually exclusive choice in implementing MPI. The next case considered exhibits interesting hybrid architecture. SGI’s Array 2.0 product is targeted at multiprocessor nodes running the 64-bit IRIX 6.2 operating system and connected by HiPPI switches with a TCI/IP fallback path. Up to eight nodes are supported in the array, and each node can contain up to 24 processors. Both PVM and MPI implementations are included in Array 2.0. The goal is to get massively parallel processor (MPP)-like performance from a networked array of high-end multiprocessor workstations.

A hybrid message passing mechanism is employed in Array 2.0. Shared memory is used to transmit messages between processors on the same node, and distributed memory is used for messages sent between nodes over the HiPPI link. Optimizations by message length are provided, but in an unusual way: The latency of short messages is reduced by bypassing the overhead of the HiPPI framing protocol, and the bandwidth of large messages is increased by transmitting them in parallel using multiple HiPPI adapters, if available. (This is known as “striped HiPPI.”) Other characteristics include the use of an environment variable to explicitly specify the number of unexpected messages to buffer, and the up-front admission that the Array 2.0 implementation of MPI is not thread-safe, the intention of the MPI standard notwithstanding.

Another approach to achieving MPP-like message passing performance on a network of workstations was taken by the University of Illinois at Urbana-Champaign (UIUC). MPICH was implemented on top of UIUC’s Fast Messages (FM) low-level communication library, which runs entirely in user space and avoids the overhead of kernel context switches. The target platform was restricted to Sun workstations equipped with LANai interface processors and connected by a packet-switched Myrinet network. Although the limitations would appear to be severe (specialized physical and transport layers, homogenous workstations), the results were outstanding: Extremely low latency was achieved, and the bandwidth for short and medium messages was comparable to MPI implementations on MPP-class machines.

Several technical hurdles were overcome to achieve this low latency. The LANai control program was kept simple because the coprocessor was quite slow in comparison to the host processor. Two semantic clashes between MPICH and FM were encountered. MPICH uses a split-phase message send, while FM is stateless; this was resolved by using sender and receiver queues to track state information. On the receive side, FM has no explicit receive operation but instead relies on user-defined receive handlers in the style of Active Messages; the ADI receive primitives were implemented in terms of such handlers, into which FM calls were carefully placed. Two optimizations to FM itself were developed to eliminate message copies--a “gather” mechanism on the send side to efficiently transmit arbitrarily large messages consisting of non-contiguous data, and an upcall mechanism to retrieve the receiver buffer address so that message reassembly could be done directly at the destination. The gather mechanism also provided the requisite optimizations for short and long messages. The performance of the two optimizations taken together was found to exceed the sum of its parts, because it kept the message pipeline full.

MPI is also available on uniprocessor Intel machines running variants of Microsoft Windows. The first such implementation was WinMPI, an early port of MPICH by a graduate student at the University of Nebraska at Omaha. WinMPI runs on a standalone Windows 3.1 PC, and represents each parallel process by a separate Windows task. The P4 channel device was used as the lower-level communication layer and implemented as a Dynamic Link Library (DLL). Global (i.e., shared) memory allocated by the DLL is used for message exchange. The purpose of WinMPI was ostensibly to serve as a testing and training platform for parallel programming in MPI.

As can be imagined, numerous technical problems had to be solved in order to implement MPI in such a restricted environment. Most involved the addition of extensions to the WinMPI API to compensate for the limitations, which have the unfortunate side effect of requiring minor source code changes to MPI programs. For example, a different program entry point (MPI_main) is needed; all integers are declared long to bring them up to the 32-bit level; memory allocation functions were added to the DLL to bypass the allocation limitations of the medium memory model required for the pseudo-parallel Windows tasks; several I/O calls were added to the DLL to provide Graphical User Interface (GUI) analogs to the standard UNIX console I/O functions; and a processor yield function was created because of the non-preemptive nature of the Windows multitasking model.