Memory Management

Memory Managementby Glen Lancasteredited by Dennis Mumaugh

1. Some Memory Management History

The process model of an operating system puts constraints and demands on the memory management that it uses. Consider these two types

Monoprogramming

Multiprogramming

Early systems required executable programs to be in "one piece" or contiguous when loaded in memory.

Example 1.

C-version Assembly version Machine code version

Main.c Main.s loc. Main.o

------

| .... | .... 00 | ....

| while(...){ | loop: ... 08 | ...

| ... | ... 16 | ...

| } | jmp loop 24 | jmp 08

| | |

Here we assume each machine instruction is 8 bytes long. Each time an instruction is fetched, the PC is incremented by 8 to contain the address of the next instruction. This simple fact about the CPU hardware meant that the instructions of a program HAD to be loaded into consecutive memory locations. You couldn't put part of a program code in one place in memory and part in another disconnected part of memory.

Example 2. Subroutine calls and linking.

Suppose Main.c calls a subroutine f in the file Sub.c

C-version Assembly version Machine code version

Main.c Main.s loc. Main.o

------

| .... | .... 00 | ....

| while(...){ | loop: ... 08 | ...

| ... | ... 16 | ...

| } | jmp loop 24 | jmp 08

| f(); | call _f 32 | call 0000

C-version Assembly version Machine code version

Sub.c Sub.s loc. Sub.o

------

| void f() | _f:... 00 | ...

| { | ... 08 | ...

| while(){ |loop2:... 16 | ...

| ... | ... 24 | ...

| } | jmp loop2 32 | jmp 16

| } | ret 40 | ret

| | |

The link step first concatenates these two object files:

Machine code version before addresses are "FIXED"

loc. Main.o

------

00 | ....

08 | ...

16 | ...

24 | jmp 08

32 | call 0000 ; This address must be RESOLVED

------

40 | ... ; Sub.o starts here

48 | ...

56 | ...

64 | ...

72 | jmp 16 ; This address must be RELOCATED

80 | ret

|

Machine code version after addresses are "FIXED"

loc. Main.o

------

00 | ....

08 | ...

16 | ...

24 | jmp 08

32 | call 40 ; This address was RESOLVED

------

40 | ... ; Sub.o starts here

48 | ...

56 | ...

64 | ...

72 | jmp 56 ; This address was RELOCATED

80 | ret

|

1a. monoprogramming

Simple monoprogramming systems can load compiled/linked programs into the same location each time the program is to be executed. In the example above, the program would have to be loaded at location 0 each time it was executed. For if it were instead loaded at location 100, we would get:

Machine code version after addresses are "FIXED"

loc. Main.o

------

100 | ....

108 | ...

116 | ...

124 | jmp 08

132 | call 40 ; This address was RESOLVED at link time

------

140 | ... ; Sub.o starts here

148 | ...

156 | ...

164 | ...

172 | jmp 56 ; This address was RELOCATED at link time

180 | ret

|

The jmp instructions and the call instruction now will go to the wrong addresses. In monoprogramming, this is not a big problem, since the system only has one program to execute in memory at a time. It can always be loaded at the same place. Memory protection is not a big problem, at least as far as users in a monoprogramming system, since there is by definition only ONE user program in memory at a time.

1b. Multiprogramming

In multiprogramming two or more programs will be loaded in memory to execute at the same time. Clearly, we can't load both of them starting at location 00. Therefore, Multiprogramming may require some extra effort on the part of the loader to "FIX" addresses.

Memory protection is also a problem in a multiprogramming system. What is to keep the code of one program from containing an instruction that has a jmp to an address that is in another program's code in memory?

Addresses can be logical (labels), numerical (but not yet relocated or resolved), but are at some point translated to physical numerical memory addresses. This process of associating a final physical memory address with a logical address in the program (C, assembler, or machine object file) is called "address binding".

1c Address Binding

There are different times at which address binding can occur, ranging from very early to very late:

Compile/Link time.

Load time. (When the whole program is loaded.)

Execution time (Each time each instruction is executed.)

Compile/Link Time.

In a monoprogramming system, address binding could occur at compile/link time. All program addresses would be relocated or resolved by the linker. The loader would be a "simple" loader. It would always load any executable program into the same starting location (00 in our example) from disk.

Load Time.

In a multiprogramming system, the linker could resolve calls at link time, but a "relocating loader" would be needed. It would relocate jmp and call target addresses by adding to each one the starting address of where the program was loaded. This starting address will, in general, be different each time the program is executed, as it depends on what memory is not being used at the time.

Execution Time.

An alternative to binding at load time is to wait even longer to translate jmp and call target addresses to physical addresses. They could be translated just as the jmp or call instruction is about to be executed. The loader would just be a simple loader again. It would not resolve any jmp or call target addresses. So they would be "wrong" and still need to be fixed before the instruction is executed. How? Good question. This is what virtual memory systems (discussed below) do.

Position IndependentCode [added by dlm]

Most modern computer architectures have an instruction set that supports Position Independent Code (PIC) and Position Independent Data (PID). This takes advantage of the design of the instructions and especially the address. The instruction address field is composed of an address and register combination. The instruction looks like:

opcode / mode / register1 / address1 / register2 / address2

Where opcode is the instruction

mode is how the address is calculated

register is ageneral-purpose register

address is the address of the operand

One of the modes can be immediate then the address is used as the value
PC relative then the address is used as a signed offset from the PC

If the compiler generates PIC then it is possible to link an executable that does not need tobe executed at any specific location. Each chip designer tries to build this feature into the instruction set and most compiler developers provide for this capability.

Sample instructions:

mov address, r0 => mov offset(pc), r0

mov #123, r0 => movi 123, r0

2. Best-Fit, Worst-Fit, First-Fit etc Problem for contiguous memory systems.

Given a list of memory requests. The problem is to find free memory that is big enough for the request and that is in one piece (contiguous). If there are several choices, is one choice better than another? There are at least three possible algorithms: best-fit, first-fit, and worst-fit. We will look at an example in class. None of the algorithms will out perform the others in all situations. The main problem is that memory gets "fragmented" into small contiguous pieces that cannot be allocated. Several free blocks of memory cannot be combined to satisfy a memory request since they do not form ONE contiguous block as is required for program code.

Question: How does the process model of MS-DOS affect the memory requests and the memory releases? How does that affect the way in which you would allocate memory for processes?

3. Virtual Memory

3.1 Addresses bound at execution time

3.2 MMU sits between MAR and other CPU components

The Memory Management Unit (or MMU) is a hardware component of the CPU (if it has it) that is responsible for translating logical (also called virtual) addresses to physical addresses.

3.3 Contiguity requirement dropped.

If the memory need not be contiguous, then it will consist of 2 or more contiguous pieces. There are then two major alternatives:

a) The pieces are allowed to be of varying sizes

Or b) the pieces are all the same size

In the first case, the pieces are called segments. In the second, pages. We will confine the discussion to the paged virtual memory version for the rest of this lecture.

So physical memory is divided into EQUAL sized units called page frames. A typical size is 1K bytes. Executable program code files are then split into the same size units (1K bytes), which are called "virtual pages" or "logical pages" or simply "pages". So one page fits exactly into one page frame.

When using virtual memory the operating system must keep track of the free page frames instead of a list of free blocks maintained when using contiguous memory allocation. It need not keep track of the size, since all page frames are the same fixed size.

Picture for contiguous storage:

(logical) (physical) size of free

address program (code) address physical memory

0 +------+ 0 +------+

| | | |

+ + + + 300

| | | free |

+ + + +

| | | |

+ + 300 +------+

| | | used |

+ + 400 +------+ 100

| | | free |

499 +------+ 500 +------+

| used |

600 +------+

| | 300

+ +

| free |

+ +

| |

899 +------+

Picture for paged storage:

(logical) (physical)

address program (code) address physical memory

0 +------+ 0 +------+

| | | free |

100 +------+ +------+

| | | free |

200 +------+ +------+

| | | free |

300 +------+ 300 +------+

| | | used |

400 +------+ 400 +------+

| | | free |

499 +------+ 500 +------+

| used |

600 +------+

| free |

+------+

| free |

+------+

| free |

899 +------+

Assume the page size is 100 bytes for this example. We divide the program into pages and memory is already divided into page frames. Suppose that free list of frames is (0,1,2,4,6,7,8). We can use the first five to hold our program:

pagepage frame contents

+------+ +------+

0 | | 0 | page 1 |

+------+ +------+

1 | | 1 | page 0 |

+------+ +------+

2 | | 2 | page 2 |

+------+ +------+

3 | | 3 | |

+------+ +------+

4 | | 4 | page 3 |

+------+ +------+

5 | |

+------+

6 | page 4 |

+------+

7 | |

+------+

8 | |

+------+

3.4 Address translation

A virtual address like 425 thought of as consisting of two parts: the page and the offset from the beginning of that page.

page number = floor(address/page_size)

offset = address mod page_size

So for virtual address 425,

page number = floor(425/100) = 4

offset = 425 mod 100 = 25

For each page of the program, the MMU keeps track of which page frame has been allocated to it. For the example above, page 4 has been allocated page frame 6. To find out what physical address has been allocated to a virtual address like 425 we just have to add the offset to the address of the beginning of the page frame. But the beginning page frame address is just the page frame number times the page size. So

physical address = page_frame * page_size + offset

E.g., page 4 is loaded in page frame 6. So an instruction at virtual address 425 is on page 4 and offset 25. Its physical address is then

physical address = 6 * 100 + 25 = 625.

3.5 MMU and page tables (abstract and concrete)

To do the translation above, an abstract table, PT - the page table - is used. Its contents are page frame numbers, and it is indexed by the virtual page number so that PT [n] contains the page frame number that was allocated to virtual page n. The page table for the example above would be:

virtual page page frame

+------+

0 | 0 | PT[0] = 0

+------+

1 | 1 | PT[1] = 1

+------+

2 | 2 | PT[2] = 2

+------+

3 | 4 | PT[3] = 4

+------+

4 | 6 | PT[4] = 6

+------+

So the address translation of virtual address va = 425 is calculated as follows

1. Extract the page number and offset from the virtual address:

n = page number = floor(va/page_size) = floor(425/100) = 4

off = offset = va mod page_size = 425 mod 100 = 25

2. Then use the page table to look up the page_frame which holds this page.

page_frame = PT[n] = PT[4] = 6

3. Calculate the physical address

physical address = page_frame * page_size + offset

= 6*100 + 25

= 625

The abstract page table above might be concretely implemented as a special internal memory in the MMU, say 8 locations starting at MR0. Suppose also that each location can hold a 4-byte integer.

+------+

MR0 | 0 |

+------+

MR0 + 1*4 | 1 |

+------+

MR0 + 2*4 | 2 |

+------+

MR0 + 3*4 | 4 |

+------+

MR0 + 4*4 | 6 |

+------+

MR0 + 5*4 | ? |

+------+

MR0 + 6*4 | ? |

+------+

MR0 + 7*4 | ? |

+------+

So the page_frame lookup in step 2. above for virtual page 4 would be

page_frame = contents(MR0 + 4*4) = 6

and more generally, the page_frame that holds virtual page n would be

page_frame = contents(MR0 + n*4)

3.6 Benefits

Memory allocation is simple again, even in a multiprogramming environment. We no longer need complicated memory allocation algorithms like best-fit, first-fit, etc. If 20 pages are needed, then ANY free 20 page frames will do. They need not be consecutive page frames.

Fragmentation revisited

We still have a little wasted space "internal fragmentation." Each program will not use 1/2 of the last page on the average. So just like cluster sizes for files, the page size should not be too large in order to avoid wasted disk and memory space.

3.7 Costs

3.7.1 MMU makes the CPU more complex

This may make it more expensive. It also may contribute to tradeoffs regarding other components in the CPU - fewer user registers, etc.

3.7.2 Problem if page table is to large to fit in MMU registers

In this case, the MMU is organized a little differently. It will contain a special associative memory or cache to hold recently used page table entries.

For example, suppose the actual page table can hold up to 50 entries but the MMU associative translation cache can only hold 10 items. Each cache entry consists of a pair - a key and a value. The value is a page_frame number. The key is part of the virtual page number that is stored in that page_frame. Since the cache has only five entries, there are not enough for each possible virtual page, n. So n is split into two parts, floor(n/10) and n mod 10. For a number between 00 and 49, these are just the left and right digits. The right digit is used as an index into the cache, C, and then the key is compared with the left digit. If it matches this is the entry for n. If the key does not match the cache does not contain an entry for n.

For example to look for the entry for virtual page 04, the right digit is 4, so we look up C[4]. Now C[4].key = 0, which matches the left digit. So the

page_frame number for virtual page 04 = C[4].value = 6.

This is a cache "hit" since the entry for 04 was in the cache.

An example of a cache miss would occur if we look up the entry for virtual page 35. Since 35 has the right digit = 5, we first compare C[5].key (= 4) with the left digit (= 3). Since they are not the same, the cache does not contain the entry for virtual address 35. Instead, it does contain the entry for virtual address 45.

index key value valid

|_____|______|______|

0 | 4 | 6 | 1 |

|_____|______|______|

1 | 4 | 7 | 1 |

|_____|______|______|

2 | 4 | 12 | 1 |

|_____|______|______|

3 | 1 | 8 | 1 |

|_____|______|______|

4 | 0 | 6 | 1 |

|_____|______|______|

5 | 4 | 9 | 1 |

|_____|______|______|

6 | 3 | 11 | 0 |

|_____|______|______|

7 | 5 | 22 | 0 |

|_____|______|______|

8 | 5 | 3 | 0 |

|_____|______|______|

9 | 5 | 13 | 0 |

|_____|______|______|

Note that for this cache organization, only ONE virtual address in the range 30 - 39 can have an entry in the cache at a time. If we need the entry for 35, the MMU would have to read the entry from the page table in memory. Suppose virtual page 35 is in page frame 15. Then after reading this entry, the MMU would change the entry in the cache by setting

C[3].key = 5 and C[3].value = 15

3.7.3 Effective Memory Access Time using an Associative Translation Cache

When the MMU hardware using an associative translation cache is presented, a virtual address to translate it tries to get the page table entry from memory at the same time it searches the associative translation cache. If found in the associative translation cache, it cancels the memory read. If not found, it waits until it gets the page table entry from memory, then the MMU puts that entry in the appropriate translation cache slot, and calculates the correct physical memory address as usual.

Suppose

a) the translation buffer "search" takes 20 ns

b) a memory read takes 100 ns

c) for 90% of translations the entry is found in the buffer.

what is the effective time to access data given the virtual (not the physical) address? how much more time does the MMU add to each memory access? it's more than 20ns.

Answer:

For a "hit" (the entry is in the MMU translation buffer cache) the time is

20 ns (search the cache for the entry for this virtual address and do the translation in 0 time)

100 ns fetch the item from the physical memory address just calculated.

total 120 ns

b) For a "miss" (the entry is not in the MMU translation buffer cache) the time is

100 ns (get the correct page table entry from memory)

0 ns translate the address

100 ns get the item from the physical memory address just calculated

total 200 ns.

the effective access time is the weighted average of these two, weighted according to how often each occurs:

effective access time =

hit_frequency * total_access_time_for_a_hit +

miss_frequency * total_access_time_for_a_miss

= .90 * 120 ns + .10 * 200 ns

= 108 ns + 20 ns

= 128 ns

Without the MMU and virtual addresses, a memory access would not need to translate an address and the access time would just be 100 ns. using virtual memory addresses and the MMU changes this to 128 ns - an increase of 28 ns for each virtual address access.

Even if the hit "ratio" is 100%, the effective access time will be 120 ns. At the other extreme, if there is no cache for holding page table entries in the MMU, the hit ratio will be 0 and the effective access time using virtual addresses will be 200 ns - twice the 100 ns required for a physical memory access. We want to hit ratio to be high so that the effective access time is close to the 120 ns minimum.

3.8 Partial loading

3.8.1 Requirements to support partial loading

a) MMU page table must now indicate whether the page is loaded or not. Add a bit field to each entry, the valid bit. If the page is loaded in memory, the valid bit is 1, otherwise valid is 0.

b) Must have a fetch policy - how many and which pages to load when a page is referenced that has not been loaded.

c) Must have a replacement policy - if there is no room for the new page, the one that is not being used must be replaced. Which one?

d) efficient handling of references to pages not loaded - page faults (like interrupts but a little different)

The modified page table now looks like this:

virtual valid

page bit page frame

+-----+------+

0 | 1 | 0 | PT[0].valid = 1, PT[0].frame = 0

+-----+------+

1 | 1 | 1 | PT[1].valid = 1, PT[1].frame = 1

+-----+------+

2 | 0 | ? | PT[2].valid = 0, page not loaded

+-----+------+

3 | 1 | 4 | PT[3].valid = 1, PT[3].frame = 4

+-----+------+

4 | 1 | 6 | PT[4].valid = 1, PT[4].frame = 6

+-----+------+

3.8.2 page faults

A page fault occurs when a page is referenced which is not currently loaded in memory.

This can happen in the fetch step of the CPU cycle when the next instruction is to be loaded from a memory address and the page containing that part of the code is not in memory.

It can also occur when an instruction's operands are being fetched just prior to execution of the instruction.

Or it can happen when an instruction is being executed in the CPU cycle - for example, an store instruction that moves the value in a CPU register to a (virtual) memory location, but that virtual memory location is not currently in memory.