Lecture 8: Memory Management: the Kernel

Operating Systems

Lecture 8:

Memory Management

Physical memory Abstraction: virtual memory

No protection Each program isolated from all others and from the OS

Limited size Illusion of infinite memory

Sharing visible to programs Transparent -- can't tell if memory is shared

Easy to share data between programs Ability to share code, data

Hardware support for protection

How is protection implemented?

Hardware support: address translation , dual mode operation

Address translation

Address space: literally, all the addresses a program can touch. All the state that a program can affect or be affected by.

Restrict what a program can do by restricting what it can touch!

Hardware translates every memory reference from virtual addresses to physical addresses; software sets up and manages the mapping in the translation box.

Virtual Physical

Address Address

User mode

Kernel mode

Virtual

Address

(untranslated)

Address Translation in Modern Architectures

Two views of memory:

view from the CPU -- what program sees, virtual memory

view from memory -- physical memory

Translation box converts between the two views.

Translation helps implement protection because no way for program to even talk about other program's addresses; no way for them to touch operating system code or data.

Translation can be implemented in any number of ways -- typically, by some form of table lookup (we'll discuss various options for implementing the translation box later). Separate table for each user address space.

Application can not modify its own translation table. If it could, could get access to all of physical memory. Has to be restricted somehow. Dual-mode operation enables control.

when in the OS, can do anything (kernel-mode)

when in a user program, restricted to only touching that program's memory (user-mode)

Hardware requires CPU to be in kernel-mode to modify address translation tables.

OS runs in kernel mode (untranslated)

User programs run in user mode (translated)

Want to isolate each address space so its behavior can't do any harm, except to itself.

How does kernel and user interact?

Kernel -> user:

To run a user program,

create a process/thread to:

allocate and initialize address space control block

read program off disk and store in memory

allocate and initialize translation table (point to program memory)

run program (or to return to user level after calling the kernel):

set machine registers

set hardware pointer to translation table

set processor status word (user vs. kernel)

jump to start of program

User-> Kernel:

How does the user program get back into the kernel?

Voluntarily user->kernel: System call -- special instruction to jump to a specific operating system handler. Just like doing a procedure call into the operating system kernel -- program asks OS kernel, please do something on procedure's behalf.

Can the user program call any routine in the OS? No. Just specific ones the OS says is ok. Always start running handler at same place, otherwise, problems!

Involuntarily user->kernel: Hardware interrupt , also program exception such as bus error, segmentation fault, page fault

On system call, interrupt, or exception: hardware atomically

sets processor status to kernel

changes execution stack to kernel

saves current program counter

jumps to handler in kernel

handler saves previous state of any registers it uses

How does the system call pass arguments?

a. Use registers

b. Write into user memory, kernel copies into its memory

Except: user addresses -- translated

kernel addresses -- untranslated

PROTECTION

Base and Bounds

Base and bounds: Each program loaded into contiguous regions of physical memory, but with protection between programs. First built in the Cray-1.

BOUND BASE

Logical physical

address yes address

trap; addressing error

Hardware Implementation of Base and Bounds Translation

Program has illusion it is running on its own dedicated machine, with memory starting at 0 and going up to size = bounds. Like linker-loader, program gets contiguous region of memory. But unlike linker-loader, protection: program can only touch locations in physical memory between base and base + bounds.

Logical address space Physical address space

4000

bound

4000+bound

Base=4000

Virtual and Physical Memory Views in Base and Bounds System

Provides level of indirection: OS can move bits around behind the program's back, for instance, if program needs to grow beyond its bounds, or if need to coalesce fragments of memory.

Stop program, copy bits, change base and bounds registers, restart.

Only the OS gets to change the base and bounds! Clearly, user program can't, or else lose protection.

Hardware cost:

2 registers

adder, comparator

Plus, slows down hardware because need to take time to do add/compare on every memory reference.

Base and bounds is simple and fast but it has the following disadvantages:

1. hard to share between programs for example, suppose two copies of "vi" :we want to share code , only data and stack need to be different . We can't do this with base and bounds!

2. hard to grow address space. We want stack and heap to grow into each other (have to allocate maximum future needs.

3. needs complex memory allocation such as first fit, best fit, buddy system . In worst case, it is needed to shuffle large chunks of memory to fit new program.

Solution to 1 & 2 : (segmentation),

Solution to 1 & 3 : (paging),

Solution to 1 & 2 & 3 : (segmentation plus paging)!

8.2.3.2 Segmentation

A segment is a region of logically contiguous memory. Idea is to generalize base and bounds, by allowing a table of base & bound pairs.

Virtual address error

For example, what does it look like with this segment table, in virtual memory and physical memory? Assume 2 bit segment ID, and 12 bit segment offset.

Virtual Segment # Physical Segment Start at Segment size

Code 0x4000 0x700

Data 0 0x500

- 0 0

Stack 0x2000 0x1000

Although it seems that the virtual address space has gaps in it, each segment gets mapped to contiguous locations in physical memory, but may be gaps between segments. But a correct program will never address gaps; if it does, trap to kernel. Minor exception: stack, heap can grow.

Segmentation is efficient for sparse address spaces. It is easy to share whole segments (for example, code segment) Only a protection mode can be added in segmentation table. For example, code segment would be read-only (only execution and loads are allowed). Data and stack segment would be read-write (stores allowed). But, segmentation still needs complex memory allocation such as first fit, best fit, etc., and re-shuffling to coalesce free fragments, if no single free space is big enough for a new segment.

How do we make memory allocation simple and easy?

8.2.3.3 Paging

Allocate physical memory in terms of fixed size chunks of memory, or pages.

Simpler, because allows use of a bitmap. What's a bitmap?

001111100000001100

Each bit represents one page of physical memory -- 1 means allocated, 0 means unallocated. Lots simpler than base&bounds or segmentation

Operating system controls mapping: any page of virtual memory can go anywhere in physical memory.

Each address space has its own page table, in physical memory. Hardware needs two special registers -- pointer to physical location of page table, and page table size. Example: suppose page size is 4 bytes.

error

Virtual address

Physical address

Page table translation

Questions:

1. What if page size is very small? For example if page size is 512 bytes, means lots of space taken up with page table entries.

2. What if page size is really big? Why not have an infinite page size? Would waste unused space inside of page. Example of internal fragmentation.

With segmentation need to re-shuffle segments to avoid external fragmentation. Paging suffers from internal fragmentation.

3. What if address space is sparse? For example: on UNIX, code starts at 0, stack starts at

2^31 - 1. With 1KB pages, 2 million page table entries -- because have to have table that maps

entire virtual address space.

Paging is a simple memory allocation. It is easy to share but needs big page tables if the address space is sparse.

Is there a solution that allows simple memory allocation, easy to share memory, and is efficient for sparse address spaces?

Combining paging and segmentation?

8.2.3.4 Paged Segmentation (Multi-level translation)

Multi-level translation. Use tree of tables. Lowest level is page table, so that physical memory can be allocated using a bitmap. Higher levels are typically segmented. For example, 2-level translation:

Virtual address

Page table

Ptr table size page table

Segment table Physical address

Just like recursion -- could have any number of levels. Most architectures today do some flavor of this.

Questions:

1. Where are segment table/page tables stored? Segment tables are usually in special CPU registers, because they are small. Page tables, usually in main memory

2. How do we share memory? Can share entire segment, or a single page.

Multilevel translation only needs to allocate as many page table entries as we need. In other words, sparse address spaces are easy. Memory allocation is easy. Sharing can be done at segment or page level. But it has some disadvantages as well. A pointer is needed per page (typically 4KB - 16KB pages today). Page tables need to be contiguous . Two lookups per memory reference needed.

TLB: Translation Lookaside Buffer (a kind of page table cache )

Generic Issues in Caching

Cache hit : item is in the cache

Cache miss : item is not in the cache, have to do full operation

Effective access time = P(hit) * cost of hit + P(miss) * cost of miss

1. How do you find whether item is in the cache (whether there is a cache hit)?

2. If it is not in cache (cache miss), how do you choose what to replace from cache to make room?

3. Consistency -- how do you keep cache copy consistent with real version?

Use caching at each level, to provide illusion of a terabyte, with register access times. Works because programs aren't random.

Exploit locality : that computers behave in future like they have in the past.

Temporal locality : will reference same locations as accessed in the recent past

Spatial locality : will reference locations near those accessed in the recent past

Caching applied to address translation

Often reference same page repeatedly, why go through entire translation each time?

Translation Buffer, Translation Lookaside Buffer : hardware table of frequently used translations, to avoid having to go through page table lookup in common case. Typically, on chip, so access time of 5-10ns, instead of several hundred for main memory.

How do we tell if needed translation is in TLB?

1. Search table in sequential order

2. Direct mapped: restrict each virtual page to use specific slot in TLB

Consistency between TLB and page tables

What happens on context switch?

Have to invalidate entire TLB contents. When new program starts running, will bring in new translations. Alternatively, include process id tag in TLB comparator. Have to keep TLB consistent with whatever the full translation would give.

What if translation tables change? For example, to move page from memory to disk, or vice versa. Have to invalidate TLB entry.

Relationship between TLB and hardware memory caches

Can put a cache of memory values anywhere in this process. If between translation box and memory, called a "physically addressed cache". Could also put a cache between CPU and translation box: "virtually addressed cache".

Virtual memory is a kind of caching: we're going to talk about using the contents of main memory as a cache for disk.

Page Replacement Algorithms:

FIFO: First -in -First -Out (Belady`s Anomaly)

LRU: Least Recently Used (Implementing with counters, stacks etc)

NRU: Not Recently used

Optimal:

Clock algorithm : arrange physical pages in a circle, with a clock hand.

1. Hardware keeps use bit per physical page frame

2. Hardware sets use bit on each reference ,If use bit isn't set, means not referenced in a long time

3. On page fault: Advance clock hand (not real time)

check use bit

1 -> clear, go on

0 -> replace page

Will it always find a page or loop infinitely? Even if all use bits are set, it will eventually loop around, clearing all use bits -> FIFO

What if hand is moving slowly?

Not many page faults and/or find page quickly

What if hand is moving quickly?

Lots of page faults and/or lots of reference bits set.

Nth chance algorithm : don't throw page out until hand has swept by n times

OS keeps counter per page -- # of sweeps

On page fault, OS checks use bit:

1 => clear use and also clear counter, go on

0 => increment counter, if < N, go on

else replace page

How do we pick N?

Why pick large N? Better approx to LRU.

Why pick small N? More efficient; otherwise might have to look a long way to find free page.

Dirty pages have to be written back to disk when replaced. Takes extra overhead to replace a dirty page, so give dirty pages an extra chance before replacing?

Common approach:

clean pages -- use N = 1

dirty pages -- use N = 2 (and write-back to disk when N=1)

To summarize, many machines maintain four bits per page table entry:

use : set when page is referenced, cleared by clock algorithm

modified : set when page is modified, cleared when page is written to disk

valid : ok for program to reference this page

read-only : ok for program to read page, but not to modify it (for example, for catching modifications to code pages)

Inverted Page Table

Page tables map virtual page # -> physical page #

Do we need the reverse? physical page # -> virtual page #?

Yes. Clock algorithm runs through page frames. What if it ran through page tables?

(i) many more entries

(ii) what if there is sharing?