Parlib: A Simple Interface to Parallel Middleware

The basic library was written as part of Mathematics 5490 at the University of Wyoming during the Spring semester, 2009 by Derrick Cerwinsky. It was greatly generalized by Prof. Douglas for the 2010 Winter Enhancement Program (KE 244) at the King Abdullah University of Science and Technology (KAUST) ( and courses since then.

The code is quite robust and believed to be bug free:

( parlib)

The code is open source. It may be redistributed freely and augmentedas long as the initial comments are maintained intact.

For any given communications middleware, there are many, many functions that are not given in parlib. You are free to add new functions. It is straightforward to modify parlib for parallel middleware systems other than MPI (Message Passing Interface).

Middleware has changed over the years and putting specific calls to a particular system (e.g., MPI) is a really bad idea. Using an interface to an arbitrary middleware system with a small number of assumptions about what is provided adds a small amount of overhead, but gives your codes flexibility in the future.

In particular, you do not have to change your code in the future, just relink your code with another version of parlib.

The MPI version has been tested on the following systems:

Mac OS X 10.8 (Mountain Lion)

Mac OS X 10.7 (Lion)

Mac OS X 10.6 (Snow Leopard)

Mac OS X 10.5 (Leopard)

Linux (Redhat Enterprise Linux 5.x and 6.x, Centos Linux)

with gcc and g++. It has been tested primarily on Linux and MacOS/X, however.

A native Windows+MPI version should not be hard to build and verify that it works correctly (using either Microsoft’s or Argonne National Labs’ MPI implementations). It works on Windows 7.

Building Parlib

You need to have the correct middleware (MPI in this case) for parallel communications installed first. Look at the Makefile and make certain the correct C compiler is defined in the CC statement near the beginning of the file.

Then type

make

in a terminal (command) window. This will make the library, the C examples, and the C++ examples. The Makefile is long, appears formidable, but is highly repetitive. If you can decipher a few lines, then you can decipher it all. When in doubt, ask me for help.

If you just want just the C or the C++ examples built, but not the other, then you type either

make cexes

or

make cxxexes

in a terminal (command) window instead of just make.

Augmenting or modifying Parlib

The source code is first copied into the top level directory from one of the following subdirectories:

Parlib / This directory contains the C code and headers that are used by parlib and any programs that use parlib.
ExamplesInC / This directory contains examples in the C language.
ExamplesInC++ / This directory contains examples in the C++ language.

The MPI version is in Parlib. Other communications systems can be added in this directory.

When you want to modify something,

  • You should modify the codes in the top level directory, not the default source directories.
  • You should create new programs in the top level directory, too.
  • Do not put anything in the source directories unless you really believe that the modified forms really work (same for new codes).

If you augment Parlib (e.g., with another interface to some function in MPI), you can send a copy back to me. I will be quite grateful.

Parlib examples

There are three examples in each of the ExamplesIn... directories. (In the C++ directory, the files have a ++ after the name.)

serial_ip / This computes an inner product of two vectors using a single processor core.
serial_ax / This computes a dense matrix-vector multiply using a single processor core.
inner_prod / This computes an inner product of two vectors using one or more processes (which may be on multiple cores).

The example inner_prod is incomplete: you will write two short pieces of code that complete the program using only functions and defined constants from parlib. Do not use anything from a parallel communications middleware since that would completely eliminate the whole point of parlib!

In each program, a random number generator is used. If you want to have reproducible results, comment out the line that sets the initial seed for the random number generator.

A useful exercise is to write the following fourth example:

matrixvec_prod / This should compute a dense matrix-vector multiply using one or more processes (which may be on multiple cores).

It should be based on serial_ax (like inner_prod and serial_ip). You need to take care on the local processors to use

  • local indexing, but remember what the global row indices really are
  • rectangular data (n wide, but n/p long), not nn or (np)(np)

You must make certain the parallel version gets the same result as the serial version without wasting memory by storing all of the data in each local memory!

Minimal Code Skeleton

#include <stdlib.h>

#include <iostream.h>

#include <mpi.h>

using namespace std;

#define PAR_MPI 1

#include “parlib.h”

int main( int argc, char* argv[] ) {

intpsize;// number of processes

intrank;// this process’ rank

par_start( &argc, &argv, &psize, &rank );// start parallelism

par_end( );// end parallelism and exit

}

Parlib constant definitions

Data types / Reduce operation
PAR_NULL / PAR_NOOP
PAR_INT / PAR_MAX
PAR_FLOAT / PAR_MIN
PAR_DOUBLE / PAR_SUM
PAR_CHAR / PAR_PROD
PAR_LAND
What else is / PAR_BAND
missing? / PAR_LOR
PAR_BOR
PAR_LXOR
PAR_BXOR
PAR_MINLOC

Parlib program headers

There are two types of data types defined in parlib:

Type / Size (in bytes) / What is it?
par_request* / par_request_size / A pointer to a request structure.
par_status* / par_status_size / A pointer to a status structure.

Parlib functions

void par_start( int argc, char **argv, int* psize, int* prank )

par_start initializes the middleware using the program arguments and returns the number of processors and the processor number in the last two arguments.

void par_end( )

par_end ends the middleware and does not return.

int par_rank( )

par_rank returns the process id, which is [0, par_size()-1].

int par_size( )

par_size returns the number of processes.

double par_walltime( )

par_walltime returns the wall clock time as a double.

int par_send( void *buff, int size, int type, int dest, int tag )

par_send sends a certain number of elements of a data type and blocks. The inputs expected are: buffer address, length, data type, destination, and tag (you pick the tag and pair it with a par_recv()).

int par_recv( void* buff, int size, int type, int source, int tag )

par_recv receives a certain number of elements of a data type and blocks. The inputs expected are: buffer address, length, data type, destination, and tag (same as in a par_send()).

Issues with a send and a receive

  • Exact behavior is determined by the specific implementation of the communications library behind Parlib (e.g., MPI).
  • par_send() may behave differently with regard to buffer size, cutoffs, and blocking.
  • par_recv() always blocks until a matching message is received from a send operation. Watch out for deadlock!
  • A receive can occur without your knowledge or knowing:
  • The amount of data in the message.
  • The sender of the message.
  • The tag of the message.
  • You may have something that let’s you query what a message is. If so, use it.
  • Preferably, never be in a position to not know what you will receive.

int par_isend( void* buff, int size, int type, int dest, int tag,

par_request *request )

par_isend sends a certain number of elements of a data type and does not block. The inputs expected are: buffer address, length, data type, destination, tag, and an address for a completion status variable for later queries.

int par_irecv( void* buff, int size, int type, int source, int tag,

par_request *request )

par_irecv receives a certain number of elements of a data type and does not block. The inputs expected are: buffer address, length, data type, destination, tag, and an address for a completion status variable for later queries.

int par_wait( par_request* request, par_status* status )

par_wait waits until a request completes. The input is the request (returned by par_isend/irecv) and the status is returned in the second argument.

int par_waitall( int count, par_request* request, par_status* status )

par_waitall waits until a set of requests all complete. The input is the set of requests (returned by par_isend/irecv) and the status is returned in the second argument.

int par_waitany( int count, par_request* requests, int* index,

par_status* status )

par_waitany waits until one from a set of requests completes. The inputs are the set of requests (returned by par_isend/irecv) and the number of requests and the status is returned in the third argument.

void par_barrier( )

par_barrier stops all processes until they all reach the barrier.

int par_bcast( void* buff, int size, int type, int source )

par_bast broadcasts data. The inputs expected are: buffer address, length, the data type, and source.

int par_reduce( void* inbuff, void* outbuff, int count, int datatype,

int reduce_op, int root )

par_reduce does a reduce operation on a set of parallel data. The inputs expected are: input buffer address, output buffer address, length of input buffer, the data type, and the processor id of the root process.

Useful additions:

  • par_sendrecv(…) swaps two sets of data in place.
  • par_getsize(…) to find out how much data has been received from a send operation.
  • User created data type (e.g., for new data structures of arbitrary size).
  • What else?

Not useful additions:

  • Anything that can only conceivably be done in MPI or another specific communications system (e.g., PVM).
  • What else?

I/O issues in distributed computing:

  • Many systems only allow process 0 in the overall set of processes access to input from stdin.
  • All other processes have to have input sent from process 0.
  • Since this is a common problem, never assume that processes 1, 2, … have access to stdin (even if they really do).
  • Multiple access to a single disk file may be serialized by the operating system, independent of the communications library. Process 0 can be used to serialize I/O by accident or design.
  • Disk file access is hard to make portable.
  • The disk system may or may not be local to the processes.
  • Nonlocal disk files take a long time to read/write to.
  • Specific I/O nodes may exist on a given parallel computer and your code has to adapt to this feature for performance reasons.
  • In most parallel programs, there is a startup or preprocessing stage that can
  • Take seconds or 1-10 wall clock minutes to accomplish.
  • The parallel code then runs for wall clock hours or days making preprocessing time of no consequence except that the parallel code runs much faster as a result of the preprocessing step.
  • The preprocessing step may be horribly inefficient from either a memory or a run time viewpoint and nobody really cares because it runs only once per job (or only once per new data set).
  • Almost no one runs a parallel code for seconds total since it is not usually worth the time to parallelize an application for so small of an overall run time (unless it runs infinitely often).

Partitioning techniques:

  • Block style
  • Assign blocks of consecutive components to each process.
  • Cyclic style
  • Assign components in a round robin fashion.
  • Block-cyclic style
  • Use a cyclic distribution of blocks of components.
  • The block-cyclic usually provides the best speedup and data communication savings at the expense of a one time extreme nuisance called programming.
  • When using ScaLapack, then investigate the source of its communications library (BLACS) to find the routines that will scatter your data in a block-cyclic manner.
  • As students, watch for professors who suggest either cyclic or block-cyclic data distributions.

Example: common_code.c

double inner_product( int vlen, double* x, double *y ) {
int i;
double sum;
// Compute a local inner product.
sum = 0.0;
for( i = 0; i < vlen; i++ )
sum += ( x[i] * y[i] );
return sum;
}
double* make_vector( int vlen ) {
int i;
double *p;
double *pset; / // Allocate space and then fill in with random numbers.
// Note: rand is not ideal as a random number generator.
if ( ( p = (double*)malloc( vlen * sizeof( double ) ) ) != NULL ) {
for( i = 0, pset = p; i < vlen; i++ )
*pset++ = ((double)rand() / (double)RAND_MAX) - 0.5;
}
return p;
}
void seed4rand( ) {
struct timeval t1; /* holder for time of day (seconds, microseconds) */
gettimeofday( &t1, NULL );
srand( t1.tv_usec * t1.tv_sec );
}

Example: serial_ip++.cpp (calculate inner product)

#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
using namespace std;
#include "common_code.c"
int main( int argc, char* argv[] ) {
int vlen; // requested vector length (global)
double* x; // left vector in inner product
double* y; // right vector in inner product
double linner_prod; // inner product result
// Get a good random number seed
// Comment this out to get repeatable results
#ifndef NO_SEED
seed4rand( );
#endif
// Get the vector length / cout < "--> Vector length (global): ";
cout.flush( );
cin > vlen;
if ( vlen <= 0 ) exit( 0 );
// Allocate space for the x and y vectors (only allocate for the local
// part of the vectors).
x = make_vector( vlen );
y = make_vector( vlen );
if ( x == NULL || y == NULL ) {
cout < "Error in memory allocation: exiting." < endl;
exit( 1 );
}
// Compute the the inner product and print it out.
linner_prod = inner_product( vlen, x, y );
cout < "Inner product = " < linner_prod < endl;
// End of program: return successful status 0
return 0;
}

Example: The incomplete inner_prod++.cpp

#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <mpi.h>
using namespace std;
#define PAR_MPI 1
#include "parlib.h"
#include "common_code.c"
int main( int argc, char* argv[] ) {
int rank; // my rank
int psize; // process total
int vlen; // requested vector length (global)
int avlen; // adjusted vector length (global)
int lvlen; // adjusted vector length (local)
double* x; // left vector in inner product
double* y; // right vector in inner product
double linner_prod; // local inner product result
double ginner_prod; // global inner product result
cout.flush( );
cin > vlen;
if ( vlen <= 0 ) exit( 0 );
lvlen = vlen / psize;
avlen = lvlen * psize;
cout < "Adjusted global vector length = " < avlen;
cout < ", local vector length = " < lvlen < endl;
}
// Fill in the parallel code here to broadcast lvlen
/* ... your code here (no MPI_... anything allowed) */
// Allocate space for the x and y vectors (only allocate for the local
// part of the vectors).
x = make_vector( lvlen );
y = make_vector( lvlen );
if ( x == NULL || y == NULL ) {
cout < "Error in memory allocation on process " < rank;
cout < " Terminate run." < endl;
par_end( );
} / // Get a good random number seed
// Comment this out to get repeatable results
#ifndef NO_SEED
seed4rand( );
#endif
// Start the parallel middleware.
par_start( argc, argv, &psize, &rank );
// Print out how many cores are being used
// and get the global vector length (which may be adjusted)
if ( rank == 0 ) {
cout < psize;
if ( psize > 1 )
cout < " processes will be used in this program run" < endl;
else
cout < " process will be used in this program run" < endl;
cout.flush( );
cout < "--> Vector length (global): ";
// Compute the local part of the inner product.
ginner_prod = linner_prod = inner_product( lvlen, x, y );
cout < "Rank " < rank <
", local inner product = " < ginner_prod < endl;
cout.flush( );
// Fill in the parallel code here to do a reduce or send/receive to
// get the global inner product
/* ... your code here (no MPI_... anything allowed) */
// Print final inner product
if ( rank == 0 )
cout < "Global inner product = " < ginner_prod < endl;
// End the parallel middleware (par_end does not return)
par_end( );
}