In Memory Computing a Solution to the Von Neumann Bottleneck

Graduate Seminar | In-Memory Computing : A Solution to the Von Neumann bottleneck

Sylvain EUDIER Union College

MSCS Candidate

- Graduate Seminar -

In-Memory Computing:

A Solution To The Von Neumann Bottleneck

Winter 2004

I. Introduction to a new architecture:

1. The von Neumann architecture:

Today’s computers are all based on the von Neumann architecture and it is important to understand this concept for the rest of this paper.

A computer as we see it, is actually a von Neumann machine in which lies 3 parts:

- a central processing unit, also called the C.P.U.,

- a store, usually the R.A.M.

- a connecting tube that will transmit a word of data between the store and the CPU or and address.

J.W Backus proposed to call this tube the von Neumann bottleneck :`The task of a program is to change the store in a major way; when one considers that this task must be accomplished entirely by pumping single words back and forth through the von Neumann bottleneck, the reason for its name becomes clear.'

We can naturally wonder: what is the influence of the von Neumann bottleneck on today’s architectures? We will shed light on this in the following paragraph.

2. A little bit of history:

During many years, processors performances have been doubling every 18 to 24 months, closely following Moore’s law[1]. This increase in performances has been possible because the die sizes increased as well. As a consequence, increasing the die implies to increase the maximal distance between two random points on the processor, clock-cycles speaking.

To solve this problem pipelining has been invented and used widely. However, such a solution also has a counterpart: it increases some latencies such that cache access, brings a negative impact on the branch prediction (if a prediction ends up being false, the whole pipelined instructions has to be processed backward) and obviously complexifies the processor design.

A second major problem in today’s architecture is the growing gap between processor speeds and memory speeds. While the former saw its performance increasing by 66% a year, the latter only increased its performances by 7% a year (basically bandwidth). This has created a problem for data-intensive applications.

3. Some solutions

To bridge these growing gaps, many methods have been proposed:

On the processor side:

- Caching: Caching has been the most widely used technique to reduce this gap. The idea is to store useful data (usually based on the number of access to a location) into a very fast and small memory with higher bandwidth to provide fast access to this particular information without having to looking in the main memory (a lot slower but also bigger).

- Prefetching: This is a kind of prediction where some data will be fetched into the cache even before it has been requested.

- Multithreading: This is an attempt to hide the latencies by working on multiples processes at the same time. If a process requires some data to continue its execution, the processor will pause this one and switch to another to process the available data. Recently Intel improved this idea with the Hyper Threading technology. The pentium 4 having such a long pipeline, it was hard to obtain a 100% CPU charge with processing-intensive applications. The HT technology simulates 2 processors so they can share the amount of work and balance the CPU utilization.

Despite all these improvement, some of them are not necessarily useful for data-intensive application. We will see that memories also tried to soften these gaps.

On the memory side:

- Access times: This has been the first step for improvement. When a processor wants to access a word in memory, it will send an address and will get back a word of data. However, the way most programs are written, if one accesses data at address n, it will likely access the data at n+1. Memory specifications are based on access times, usually four numbers to represent to how many cycles are needed to access the first word, the second, the third and fourth.(example: 3-1-1-1).

- DDR-SDRAM: DDR-SDRAM (Double Data Rate SDRAM) is an evolution of the classic SDRAM (Synchronous Dynamic Random Access Memory) on the way the data is accessed. With DDR memory, it is possible to access data from both the rising and falling edges of the clock, therefore doubling the effective bandwidth.

- RAMBUS: This is a different interface based on a simplified bus that carries data at higher frequency rates. When using two independent channels, it becomes possible to double the bandwidth.

To summarize the impact of all these performances disparities and architecture differences, let’s have a look to the following chart.

Figure 2: Memory bandwidth in a computer based on a 256MB of 16MB, 50ns DRAM chips and a 100 Mhz CPU

We clearly see that despite all these improvements, the bottleneck is still present in this architecture. And even though Rambus’ technology offers a wide bandwidth, the difference with what is available at the sense amplifiers inside the memory it a factor of almost a thousand.

Can we avoid the bottleneck and take advantage of this so-important bandwidth?

4. A new architecture:

Instead of focusing on the processor-centric architecture, researchers proposed a few years ago to switch to a memory-centric architecture. After the observation of D. Elliott[2] that memory chips have huge internal bandwidth and that the memory pins are the cause of its degradation. The main idea was then to fusion the storage and the processing elements together on a single chip and to create memories with processing capacity. Several denomination have been around to describe this, the most common are: intelligent RAM (IRAM), processor in memory (PIM), smart memories… As we will see later, the denomination is related to the application / design of these chips: main processor in the system, special purposes… To increase the amount of memory, the actual smart memories often use SDRAM instead of SRAM.

II. Different architectures

1. The IRAM architecture

The IRAM acronym stands for Intelligent RAM. It has been developed at the Berkeley University of California. IRAM has been designed to be a stand alone chip, to allow memories to be the only processing element in any kind of computing machine, i.e. PDA’s or cell phones…

A prototype has been taped out in October 2002 by IBM, for a total of 72 chips on the wafer. I sent an email to the testing group and the only reply I had was that it is still in progress. However, some estimations give can give us an idea of the available power from these memories: for a 200Mhz memory, the expected processing power is 200 Mhz * 2 ALU’s * 8 data per clock cycle = 3.2 Gops for 32 bit data, which gives us 1.6 GFlops on 32 bits data.

Figure 3: The IRAM design

Figure 4: The IRAM block diagram

We can see on this block diagram the two ALU’s of the IRAM and the vector architecture used for the processors. The vector functional units can be partitioned into several smaller units, depending on the precision required. For example, a vector unit can be partitioned into 4 units of 64 bits or 8 units for 32 bits operations.

2. The RAW architecture

The RAW architecture has been designed by the MIT. Here, the approach is different, because the processing power relies on a mesh topology. Several processors are implemented are connected altogether by 2 networks. The idea here is more to aim at a specific parallel computing memory for high scalability.

A typical tile is a RISC (Reduced Instruction Set Computer) processor, 128KB of SRAM, a FPU (Floating Point Unit) and a communication processor.

Figure 5: The RAW design

The networks used to make these tiles communicate are in the number of two:

- 2 static networks: The communication on these networks is performed by a switch processor at each tile. This processor provides throughput to the tile processor of one word per cycle with an additional latency of three cycles between nearest neighbors.

- 2 dynamic networks: When this networks are used, the data is sent in a packet to another tile, with header and data (like in a normal network).

The memory is located at the periphery part of the mesh and can be accessed either through either networks.

The expected performances of this design for a memory running at 300Mhz would be about 3.2Gflops for 32 bits data.

3. The CRAM architecture

The CRAM (for Computational RAM) architecture has been designed at the University of Alberta by Professor Elliott and continued at the Carleton University in Ottawa. This memory is probably the oldest available design because they already developed four prototypes since 1996 and the last one is currently under design.

For this architecture, the plan was to offer a massively parallel processing power for a cheap cost with the highest bandwidth available. So the PE’s (Processing Elements) are implemented directly at the sense amplifiers and are very simple elements (1-bit serial) designed to process basic information. This way, CRAM is designed to be a multi-purposes and highly parallel processing memory.

Figure 6: The CRAM design

The different PE’s can all communicate between each other through a right / left shift register.

We will describe in the next section how the CRAM performs against actual computers and in which applications the design gives its power.

III. The CRAM Architecture

In this part we will focus on the CRAM architecture for two main reasons. First, it is one of the most advanced project with already 4 prototypes working hence validity of the performance results. Secondly, the CRAM is firstly designed for multi-purposes applications and is more likely to be widely use.

1. Applications

The CRAM is very efficient for parallel computation especially if the algorithm are parallel reducible. The greater the degree of parallelism of a computation the better. Because of the way the PE’s are combined with the memory, a bigger computation does not necessary mean more time. If the problem fits in the memory, it will just be computed by more PE’s. As a consequence, more memory equals more available processing power.

However, even if the problem is not prone to parallel computing, the fact that the PE’s are directly implemented at the sense amplifiers gives anyway a great bandwidth and amortizes the decrease in performances.

In order to get an idea of how this PIM performs, we will compare results from different applications on different architectures.

For the test and applications, we will focus on three fields: Image processing, Databases searches and multimedia compression. These tests are very interesting for three reasons:

- For the different fields they represent, to demonstrate the general purpose of the CRAM,

- For their different computation models and complexity,

- Because they are based on practical problems.

Applications details:

Image processing:

I decided to work on two tests for their different purposes and model complexity used.

- Brightness is the typical parallel reducible algorithm, perfect for CRAM computation. In this algorithm, the image is placed in the memory and all the PE’s will work together on each image’s row. This is pure CRAM computation due to the parallel design.

- Average filter is a totally different algorithm because it makes the PE’s to communicate information to their neighbors. The average filter computes the value of a middle pixel by averaging the values surrounding it. Thus, even though the PE’s have a parallel work to accomplish, they also have to provide this result of their computation to their neighbor (the computation of this algorithm is by nature done on 3 columns and implies the use of the bit shifting register).

Databases searches:

In this application we are searching for example for the maximum value over a randomly generated list of numbers. This test is interesting because both the normal computer and the CRAM will have to go through all the elements to decide which one is the biggest. This is a linear time search for both of them.

Multimedia compression:

This algorithm will have to determine which part of the movie is actually moving between 2 images. As a result, it involves a lot of communication among the PE’s because of the image processing. However, this computation also brings a lot of redundant processes as we are working on blocks of pixels: good parallelism properties. Such a test will give us an idea on how the communication among the PE’s influences the performances when other advantages get in the game.

2. Performances

a. Preliminary observations

Now before going on the tests, let’s have a look to the processing complexity of basic operations for the CRAM.

Figure 7: Complexity of basic operations in CRAM

Due to the design of the PE’s as 1-bit serial processing units, the addition is performed in linear time as well as logical operations. But when we look at the multiplication and division complexity, it gets to a quadratic order and the theoretical performances are also reduced by a quadratic factor.

b. Test configurations

The tests had been run on three different platforms:

- a Pentium 133Mhz with 32 MB of RAM,

- a Sun Sparc Station 167Mhz with 64MB of RAM

- 32 MB of CRAM at 200Mhz with 64K PE’s on a Pentium 133Mhz. (simulated)

The simulation of the CRAM had been done with the simulator because of the lack of available chips for the test. Nevertheless, simulated results are close to reality are some comparisons with a real chip confirmed it.

c. Basic operations test

Before entering the applicative tests, we will have a look on how the CRAM performs on basic operations compared to these computers:

Figure 8: Basic operations comparison