Block Mapping - A Compiler Driven I/O Management Study

Wendy Zhang

Department of Computer Information Systems

Southern University at New Orleans

New Orleans, LA, U.S.A.

Ernst L. Leiss

Computer Science Department

University of Houston

Houston, TX, U.S.A

Most scientific programs have large input and output data sets that require out-of-core programming or use virtual memory management (VMM). VMM is not an affective approach because it results easily in substantial performance reduction. In contrast, compiler driven I/O management will allow a program’s data sets to be retrieved in parts, called blocks. Out-of-core programming defines the partitions of the data sets and the control of I/O. This study shows that block mapping I/O management may improve the performance of out-of-core problems by one order of magnitude or more. The study also shows that the block size chosen to partition the data sets plays an important role in compiler driven I/O management.

KEY WORDS: I/O management, block mapping, VMM, out-of-core, compiler

  1. Introduction

Most significant scientific programs’ data sets exceed the available memory; thus if one wants to avoid virtual memory management (which often suffers a severe performance penalty), these programs must be executed out-of-core, an approach that requires the explicit movement of data between main memory and secondary storage.

In the last twenty years, the processor speed has increased faster than memory speed and disk access speed. Thus, it is important to use the memory hierarchy efficiently and construct I/O minimal programs.

Virtual Memory Management (VMM) system combines special purpose memory management hardware and operating system software to provide the illusion of an unlimited amount of main memory [1]. VMM system uses a static policy for storage management. As pointed in [3], this uniform page automatic transfer method is convenient to use, but the programmers are no longer fully in control of the program. VMM systems perform poorly on many scientific applications [6].

COMANCHE (an acronym for COmpiler MANaged caCHE) is a compiler run time I/O management system [6]. It is a compiler combined with a user level runtime system and replaces for applications with out-of-core data sets. More details about COMANCHE will be given in Section 2.

Blocking or tiling [4][8] is a well-known technique that improves the data locality of numerical algorithms. Blocking can be used to achieve locality for different levels of memory hierarchy. [7][2] describe a locality optimization algorithm that applies a combination of loop interchanges, skewing, reversal, and tiling to improve the data locality of loop nests. [3] describes an I/O minimal algorithm where for a given amount of space the number of retrievals of blocks from external memory is minimal. [3] also points that the notion of I/O minimal depends on the block size available in the memory.

Out-of-core matrix multiplication is used in this paper to study the importance of blocking. An I/O minimal algorithm for matrix multiplication is sketched which forms the basis of this work. In Section 3, we discuss the minimal amount of space required to carry out the computation as well as the I/O scheme used with COMANCHE system.

In testing the proposed I/O minimal matrix multiplication algorithm, it has become clear that the size of the blocks for the sub-matrices plays a very important role. In Section 4, we will compare the performance for different block sizes. In Section 5, we draw conclusions from the results of our study.

2. COMANCHE

Comanche [6] is a runtime system. It is written in C and uses the standard I/O library calls – fopen, fclose, fread, fwrite, and fseek. The current Comanche system is running under the RedHat 5.0 Linux operating system on a PC with a PentiumPro (160Mhz or faster) processor and 64MB or more RAM.

The standard entity in the Comanche runtime is a two-dimensional array. Higher dimensions can be supported, or else the accesses can be translated to two-dimensional accesses. Data are assumed to be in row major layout on disk. The ideal array is a square matrix.

Comanche is based on a compiler-managed cache where the main memory accommodates the cache. The compiler has unique information about the data needs of the program, the size and shape of arrays, and the access patterns of the program. Note that it is this information that typically is not available to the operating system, which is customarily charged with the management of I/O calls. The out-of-core program is restructured to move data automatically between main memory and disk.

There are two structures declared in the Comanche system. One is a block structure used in buffers to keep a block of data.

typedef structure block_s

{

double *data;

int flag;

int refcnt;

int rowindex;

int columnindex;

int nrow;

int ncol;

int blockno;

} Block;

The other structure is a matrix structure. This structure holds the information of the matrix on disk and has buffers to hold several blocks in memory.

typedef struct array_s

{

int nblocks;

int nelems

int bsize;

int elesize;

FILE *fp;

long offset;

char *name;

int *map;

int nbuffers;

Block **buffers;

int victim;

int flag;

int total_mem;

} Matrix;

Besides these structures, there are several functions in Comanche. Two major functions are attach_block and release_block which are used to perform the block mapping.

When the data set is too large to fit into memory, VMM will map the array into a one-dimensional vector and then that vector is cut up into blocks. Comanche will take a row of an array and cut it up into blocks, so that a block only corresponds to a single row or a sub-matrix, and cuts it up into blocks. It is under the control of the resulting out-of-core program. The attach_block and release_block functions are used to improve the data locality to achieve better performance than VMM.

The attach_block and release_block functions tell the runtime system that the block is to be mapped into memory; then an address to the cached data is to be returned. The runtime system will not reclaim memory that has been attached as long as it remains attached. It does not matter how much time has passed since the block was last referenced. The out-of-core program manages the duration of mapped data and ensures that the number of attach operations will not over-fill the available memory before the data’s subsequent release.

Read-only, write-only, and read-write flags are added to the functions. If the flag is write-only, attach_block allocates the buffer space for that particular block. There is no need to read data from disk. If the flag is write the runtime system must write the block back to disk when it is released.

3. Block matrix multiplication

The object of our study of the Comanche run time system is matrix multiplication. Assume that there are three matrices A, B, and C of size (N, N) on disk. We will multiply A by B and store the result in matrix C. To store matrices A, B, and C in memory requires 3N2 space. If N is large, the total amount of main memory available is less than the needed space of 3N2 . Since the data cannot be entirely loaded into memory, the problem becomes out-of-core.

The traditional way of coding this in C is something like this:

for (i=0; i<N; i++)

for (j=0; j<N; j++)

{

C[i][j]=0;

for (k=0; k<N; k++)

C[I][j]+=A[I][k]+B[k][j];

}

Note that although the same row of A is reused in the next iteration of the middle loop, a large volume of data used in the intervening iterations may be replaced. During processing, a virtual memory system would do much swapping, which is very time consuming.

One solution to solve this out-of-core problem is to split each matrix into several sub-matrices (blocks) of size (M, M). Specially, if the dimensions of the matrix are divisible by the dimensions of the block we can use an algorithm suggested by the following observations.

Assume M is one half of N; then each matrix can be split into four sub-matrices of size (N/2, N/2). Schematically we have:

A11 A12 X B11 B12 = C11 C12

A21 A22 B21 B22 C21 C22

The Cij’s aredefined as follows:

C11 = A11 * B11 + A12 * B21

C12 = A11 * B12 + A12 * B22

C21 = A21 * B11 + A22 * B21

C22 = A21 * B12 + A22 * B22

Assuming all Ci,j are initialized to zero, the following instructions can be executed in any order:

C11 = C11 + A11 * B11

C11 = C11 + A12 * B21

C12 = C12 + A11 * B12

C12 = C12 + A12 * B22

C21 = C21 + A21 * B11

C21 = C21 + A22 * B21

C22 = C22 + A21 * B12

C22 = C22 + A22 * B22

A brute force approach might retrieve each of the twenty-four sub-matrices (blocks) regardless of reusability.

In out-of-core programming, we can sequence the instructions in order to reuse sub-matrices as often as possible. We use the Comanche runtime function block_attach when we need a block and block_release when we no longer need that block.

Assume that the amount of available main memory is equal to four sub-matrix, 4 * N2/4. This means that we are able to keep a whole row of blocks of the A matrix, one block of the B matrix, and one block of the matrix C in memory during processing. The following sketches an efficient approach (see [3]):

C11A11B1 // Attach C11, A11, B11

C11B21A12 // Release B11, attach B21,

A12

C12A11B12 // Release B21, C11,

attach C12, B12

C12B22A12 // Release B12, attach B22

C21A21B11 // Release all, attach C21,

A21, B11

C21B21A22 // Release B11, attach B21,

A22

C22A21B12 // Release B21, C21, attach

C22, B12

C22B22A22 // Release B12, attach B22,

A22

// Release all

In this scheme, we only retrieve eight blocks from disk and store four blocks to disk.

We use a similar, but more complicated scheme if N is not divisible by M.

4. Experiments

We have implemented the algorithm suggested in Section 3 in an out-of-core program written in the C language. The C compiler under Linux generates working code using the Comanche runtime system. On a single processor PC with a PentiumII, 333MHz microprocessor, the code was run under Redhat Linux 5.0 in command mode. The system used for the actual performance has 64MB of memory of that about 50MB are available to the program.

We have run one set of experiments for 1024 x 1024 double precision matrix multiplication. The total data set space is 1024 x 1024 x 8 x 3=24MB. Another set of experiments was run for 2048 x 2048 double precision matrix multiplication. The total data set space is 2048 x 2048 x 8 x 3 = 96MB. Data files are initialized with random values in the interval (-1, +1).

The Linux implementation requires that we execute tests on an unloaded machine and use elapsed time. For each experiment, we have recorded the following:

  1. Block size: given as number of rows and columns.
  2. User time: time used by executing the instructions of the program in seconds
  3. System time: time used by system function calls in seconds
  4. Elapsed:time used by executing the whole program
  5. Page fault: number of page faults in the virtual memory system

6. # of Seek:the number of seek function

calls used in the process

7. # of Read:the number of read function

calls used to map data from disk to memory

8. # of Write: the number of write function

calls used to map data from memory to disk

9. Memory usage: the memory locations

allocated to the data structure used to hold

the blocks in memory which is mainly one

row of blocks plus two blocks.

During initial testing we found that block mapping large files has a significant impact on the system resources. We did additional tests to assess the impact of different block sizes on running time.

4.1 Matrix multiplication

In ordinary matrix multiplication (C = A x B), the data sets of the matrices A, B, and C are mapped into memory by an mmap system call. Virtual memory is tested by mapping files into memory.

The original code for matrix multiplication

C = A x B (VMM) is given as following:

double matmul(int n, double *A)

{

int i, j, k;

double dval;

double *a, *b, *c;

double *B, *C;

double sum;

B = A + n*n;

C = B + n*n;

a = A;// A[i]

b = B;// B[i]

c = C;// C[i]

sum = 0.0;

for (i = 0; i < n; i++)

{

for (j = 0; j < n; j++)

{

dval = 0.0;

for (k = 0; k < n; k++)

dval += a[k] * b[k*n+j];

c[j] = dval;

sum += dval;

}

a += n;

c += n;

}

return sum;

}

The performance values are showed in Figures 4.1 and 4.2. Compared with block mapping, this method causes more major page faults so the elapsed time is generally very large (over half an hour for the smaller problem and 4.5 hours for the larger).

User time: 1865.48 / System time: 0.23 / Elapsed: 31:05.7 / Page fault (major): 6229 / Page fault (minor): 11

Regular matrix multiplication (VMM) : C = A x B

Figure 4.1 Performance of matrix multiplication for matrix size 1024 x 1024

Regular matrix multiplication (VMM): C = A x B

User time: 15,993.07 / System time: 16.42 / Elapsed: 4:31:27 / Page fault (major): 180,538 / Page fault (minor): 11

Figure 4.2 Performance of matrix multiplication for matrix size 2048 x 2048

4.2 The dimensions of the matrix are divisible by the dimensions of the block

To implement block multiplication, we divided the matrix into blocks. Each block has size M x M. Then we perform the block multiplication described in Section 3. We map one row of blocks of matrix A into the memory as well as one block of B and one block of C. After each block multiplication, we store the block of matrix C and release the block of matrix B. We release all blocks after we finish one row of blocks. We start over for another row of blocks until the whole matrix multiplication is finished. The block multiplication code is given as following:

double

matmul(int n,Matrix *mptr,int bs)

{

int i, j, k;

int bindex = n, cindex = n + n;

double dval;

double *a, *b, *c;

double sum = 0.0;

int bkrow = n/bs;

int bno, cno, ano, newbindex;

for (i = 0; i < n; i += bs)

{

for (j = 0; j < n; j += bs)

{

ano = i/bs * bkrow;

a=block_attach(mptr,ano,-1,

i,0,bs,bs);

assert(a);

cno=(cindex+i)/bs*bkrow+j/bs;

c=block_attach(mptr, cno, 0,

cindex+i,j,bs,bs);

assert(c);

bno= bindex/bs * bkrow + j/bs;

b=block_attach(mptr,bno,-1,

bindex,j,bs,bs);

assert(b);

sum+=sub_matmul(a,b,c,bs,bs,bs);

for (k = 1; k < bkrow; k++)

{

block_release(mptr, bno);

newbindex = bindex + k * bs;

bno = newbindex/bs*bkrow+j/bs;

b=block_attach(mptr,bno,-1,

newbindex,j,bs,bs);

assert(b);

++ano;

a = block_attach(mptr,ano,-1,

i,k*bs,bs,bs);

assert(a);

sum+=sub_matmul(a,b,c,bs,bs,bs);

}

block_release(mptr, bno);

block_release(mptr, cno);

}

for (k = 0; k < bkrow; k++)

block_release(mptr, ano--);

} // end of row loop

return sum;

}

For comparison, we have measured the performance of the matrix multiplication for different block sizes. We did experiments on the block sizes 512 x 512, 256 x 256, 128 x 128, 64 x 64, 32 x 32, 16x16, and 8 x 8. We applied these different block sizes to the data sets of 24MB and 96MB. The performance values are listed in Figures 4.3 and 4.4.

From the experiments we see that block multiplication uses only 25% to 50% of execution time of regular multiplication (VMM). The one exception is for a laughably small block size (8 x 8).

4.3. The dimensions of the matrix are not divisible by the dimensions of the block

To implement the block multiplication, we have divided the matrix N x N into blocks of size M x M. Since the dimensions of the matrix are not divisible by the dimensions of the block, the last block in a row of blocks is of size M x M%N, the last row of blocks is of size N%M x M, and the last block of a matrix is of size N%M x N%M. Before we can perform block multiplication, we save the block dimensions in a table. Then we apply the block multiplication sketched in Section 3. We map one row of blocks of A in memory, as well as one block of B and one block of C. After each macro block multiplication, we store the block of C and release block of B. We release all blocks after finish one row of blocks. We start over for another row of blocks until the whole matrix multiplication is finished.

Since we have to maintain different dimensions of blocks and perform different dimensions macro multiplication, the code is thus more complicated (code not given here, due to space limitation).

For comparison, we measured the performance of the matrix multiplication for

Case Number / 1 / 2 / 3 / 4 / 5 / 6 / 7
Block Size / 512 x 512 / 256 x 256 / 128 x 128 / 64 x 64 / 32 x 32 / 16 x 16 / 8 x 8
User time / 1023.82 / 587.61 / 438.03 / 446.15 / 340.15 / 428.33 / 963.67
System time / 12.09 / 4.52 / 11.75 / 32.26 / 105.97 / 400.82 / 1,589.48
Elapsed / 17:06.0 / 09:52.0 / 07:30.0 / 07:59.0 / 07:46.0 / 13:50.0 / 42:33.0
Page fault (major) / 97 / 97 / 97 / 97 / 97 / 97 / 97
Page fault (minor) / 2065 / 787 / 344 / 158 / 85 / 59 / 80
# of Seek / 40,960 / 122,880 / 409,600 / 1,474,560 / 5,570,560 / 21,626,880 / 85,196,800
# of Read / 30,720 / 102,400 / 368,640 / 1,392,640 / 5,406,720 / 21,299,200 / 84,541,440
# of Write / 10,240 / 20,480 / 40,960 / 81,920 / 163,840 / 327,680 / 655,360
Memory usage / 8,388,863 / 3,146,199 / 1,311,911 / 593,607 / 292,103 / 186,759 / 267,911
Figure 4.3 Performance for matrices of size 1024 x 1024, dimensions divisible by block size
Case Number / 1 / 2 / 3 / 4 / 5 / 6 / 7
Block Size / 512 x 512 / 256 x 256 / 128 x 128 / 64 x 64 / 32 x 32 / 16 x 16 / 8 x 8
User time / 8170.23 / 5815.59 / 3479.9 / 3545.58 / 2709.36 / 3589.8 / 10,257.57
System time / 33.99 / 55.65 / 110.1 / 271.85 / 872.67 / 3286.81 / 13,242.61
Elapsed / 2:30:42 / 1:40:08 / 1:09:51 / 1:32:20 / 1:10:02 / 2:02:11 / 6:41:08
Page fault (major) / 101,782 / 84,780 / 90,329 / 82,097 / 82,152 / 82,135 / 90,282
Page fault (minor) / 3,091 / 1,304 / 608 / 289 / 158 / 129 / 242
# of Seek / 245,760 / 819,200 / 2,949,120 / 11,141,120 / 43,253,760 / 170,393,600 / 676,331,520
# of Read / 204,800 / 737,280 / 2,785,280 / 10,813,440 / 42,598,400 / 169,082,880 / 673,710,080
# of Write / 40,960 / 81,920 / 163,840 / 327,680 / 655,360 / 1,310,720 / 2,621,440
Memory usage / 12,583,383 / 5,244,071 / 2,363,079 / 1,127,687 / 592,263 / 467,591 / 927,879
Figure 4.4 Performance for matrices of size 2048 x 2048, dimensions divisible by block size
Case number / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8
Block size / 500 x 500 / 260 x 260 / 130 x 130 / 65 x 65 / 33 x 33 / 17 x 17 / 15 x 15 / 9 x 9
User time / 1389.07 / 1019.51 / 905.48 / 911.2 / 876.65 / 965.33 / 992.9 / 1323.09
System time / 3.85 / 5.21 / 12.35 / 33.98 / 107.8 / 367.94 / 475.55 / 1265.6
Elapsed / 23:28.0 / 17:05.0 / 15:18.0 / 15:45.0 / 16:25.0 / 22:35.0 / 24:28.0 / 43:09.0
Page fault (major) / 97 / 97 / 97 / 97 / 97 / 6,258 / 97 / 97
Page fault (minor) / 2,459 / 812 / 355 / 170 / 119 / 170 / 199 / 462
# of Seek / 76,800 / 122,880 / 409,600 / 1,474,560 / 5,570,560 / 19,676,160 / 25,082,880 / 67,706,880
# of Read / 61,440 / 102,400 / 368,640 / 1,392,640 / 5,406,720 / 19,363,840 / 24,729,600 / 67,123,200
# of Write / 15,360 / 20,480 / 40,960 / 81,920 / 163,840 / 312,320 / 353,280 / 583,680
Memory usage / 10,000,351 / 3,245,271 / 1,353,191 / 612,183 / 309,783 / 192,639 / 187,551 / 235,359
Figure 4.5 Performance for matrices of size 1024 x 1024, dimensions not divisible by block size
Case number / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8
Block size / 500 x 500 / 260 x 260 / 130 x 130 / 65 x 65 / 33 x 33 / 17 x 17 / 15 x 15 / 9 x 9
User time / 11077.7 / 8178.1 / 7215.81 / 7256.11 / 6996.72 / 7784.24 / 8188.66 / 122248.54
System time / 77.03 / 67.32 / 127.15 / 299.93 / 884.59 / 3056.78 / 3789.09 / 10455.48
Elapsed / 3:21:37 / 2:25:38 / 2:32:20 / 2:26:00 / 2:18:40 / 3:08:57 / 3:28:05 / 6:28:04
Page fault (major) / 205,820 / 93,221 / 86,292 / 82,086 / 82,062 / 82,145 / 82,057 / 82,172
Page fault (minor) / 3,437 / 1,345 / 634 / 327 / 280 / 557 / 681 / 1,736
# of Seek / 358,400 / 819,200 / 2,949,120 / 11,141,120 / 41,932,800 / 152,401,920 / 195,000,320 / 536,985,600
# of Read / 307,200 / 737,280 / 2,785,280 / 10,813,440 / 41,287,680 / 151,162,880 / 193,597,440 / 534,650,880
# of Write / 51,200 / 81,920 / 163,840 / 327,680 / 645,120 / 1,239,040 / 1,402,880 / 2,334,720
Memory usage / 14,000,615 / 5,409,191 / 2,437,383 / 1,162,775 / 616,311 / 464,559 / 480,495 / 781,191
Figure 4.6 Performance for matrices of size 2048 x 2048, dimensions not divisible by block size

different block sizes. We did experiments for block sizes 500 x 500, 250 x 250, 130 x 130, 65 x 65, 33 x 33, 17x17, and 9 x9. We applied these different block sizes on the data sets of 24MB and 96MB. The performance values are listed in Figures 4.5 and 4.6.

From the experiments, we observe that our block multiplication uses only 50% to 75% of the execution time of regular multiplication. There one exception is again a tiny block size (9 x 9). We also see that our block multiplication uses about double the time compared to the case where the dimensions of the matrix are divisible by the dimensions of the block.

5. Conclusion

Data sets arising in scientific and engineering programs are often too large to fit into the available memory. In this case, data must be transferred from and to remote storage, such as disk. These I/O transfers can quickly dominate the execution time of the program. It becomes critically important to minimize the I/O between main memory and disk.

Building on previous work on Comanche [Robin98], we proposed an approach to the problem of improving the calculation of out-of-core problems. Our extension of Comanche involves the use of block mapping to minimize I/O.

The results of our experiments on block matrix multiplication indicate that our approach outperforms regular matrix multiplication by a significant factor, except when the block size is extremely small.

We have tested our matrix multiplication using a wide variety of block sizes, both where the dimension of the matrix is divisible by the dimensions of the block and where it is not. Not surprisingly, the results for the first group are significantly better than those of the second. If follows that one should choose block dimensions that divide the dimensions of the matrix.

Our results indicate that the blocks of size 128 x 128 and 32 x 32, using elapsed time as primary measure, perform better than others. Overall, we conclude that the block size plays a crucial role in block mapping I/O minimization.

References

[1] K. Hwang and F. A. Briggs, “Computer

Architecture and Parallel Processing”,

McGraw-Hill, Inc., New York, 1984.

[2] M. S. Lam, E. E. Rothberg, and M. E. Wolf,

“The Cache Performance and Optimizations

of Blocked Algorithms”, in 4th international

Conference on Architectural Support for

Programming Languages and Operating

Systems, pages 63-74,

Santa Clara, Calif., USA, 1991.