ECE7995-Fall-2007 Lab Exercise 1

(Deadline is 11:59pm of 10/10/07 (Wednesday)

In this lab exercise you will have an opportunity to look into how file pages are prefetched (or readahead as termed in Linux) by the Linux kernel to minimize process stall time due to slow I/O operations. We will learn how Linux does its job from two perspectives: one is to read and understand its code and the other is to observe how it behaves in response to various representative strings of read requests. We then try to correlate the observations with the code that makes them happen.

(1) Understand the Linux kernel code for prefetching.

First check out file “readahead.c” for version 2.6.18 on the LXR website: (using “file search” by typing the file name)

Second, find the description of the code in the following book:

“Understanding the Linux Kernel”O'Reilly Media; 3 edition (November 1, 2005) by Daniel Plerre Bovet and Marco Cesati

(

An on-line version is also available at:

To look for the description for readahead, go to Chapter 16: Accessing File. In the online version, search for “16.1.2. Read-Ahead of Files”.

Here are some notes about Linux readahead:

(1)Sequential access of disk data helps to improve I/O performance by taking advantage of high bandwidth of hard drive while minimize long latency incurred by random access.

(2)Prefetching improves I/O performance for two reasons

a) When it detects a sequential access pattern of applications, it sequentially reads a certain amount of data ahead of currently requested data in advance. This sequential access can be efficiently completed by disk.

b)It overlaps the data access with process computation to minimize I/O stall time, or number of page misses. This is based on an accurate prediction of what data are to be used and prefetching of them in a right timing.

(3) Therefore, the key issues involved in the prefetching policies include:

a)How to detect a sequential access pattern?

b)How much data to be prefetched?

c)At what time the prefetching requests should be issued?

Note that all data manipulationsare the in page units (4KB). To address the issues, we need appropriate data structures.

(4)To keep track of read and readahead behaviors of a file instance (corresponding to a file handler/pointer returned by a file open operation called in a process), Linux define a structure for each file instance: (using “identifier search” by typing “file_ra_state”)

653/*

654 * Track a single file's readahead state

655 */

656 struct file_ra_state {

657 unsigned long start; /* Current window */

658 unsigned long size;

659 unsigned long flags; /* ra flags RA_FLAG_xxx*/

660 unsigned long cache_hit; /* cache hit count*/

661 unsigned long prev_page; /* Cache last read() position */

662 unsigned long ahead_start; /* Ahead window */

663 unsigned long ahead_size;

664 unsigned long ra_pages; /* Maximum readahead window */

665 unsigned long mmap_hit; /* Cache hit stat for mmap accesses */

666 unsigned long mmap_miss; /* Cache miss stat for mmap accesses */

667 };

Field “prev_page” records the last read() position. To answer the first issue about detection of sequentiality, Linux code is as follows: (in readhead.c)

474 /* Note that prev_page == -1 if it is a first read */

475 sequential = (offset == ra->prev_page + 1);

“offset” is the current read position. The definition of “sequential” is either when the first read is at the beginning of a file or when the current read position is immediately after the last read position.

To answer the other two issues, two page windows are introduced: current window [start … start + size – 1] and ahead window [ahead_start … ahead_start + ahead_size -1]. Roughly speaking, current window contains pages that are being used by the process, including the pages directly accessed by processesand possibly pages that have been prefetched but not yet used. The ahead window consists of pages that are being read in advance by the kernel but not yet used. This implies that when current reads from the process walk into the ahead window, the ahead window turns into a new current window and the ahead window moves forward by prefetching data further ahead. This timing design is intended for a good overlapping of I/O and computation.

In the two-window readahead process there are two cases that special attentions need to be paid, which are reflected in the two possible values of “flags”: RA_FLAG_MISS and RA_FLAG_INCACHE. They are used to deal with two issues, respectively.

a)What should be done about readahead when prefecthed pages are evicted prematurely before they are actually use (aka thrashing)?

b)What should be done about readahead if many pages (>=256) that the kernel plans to prefetch are found to already in the memory? (the number of such pages are recorded in “cache_hit”)

In case a), next ahead window size is reduced, and in case b) the readahead is turned off, which is indicated by current window size of 0. (Does this make sense to you?)

(5)Functions about readahead are in readahead.c. The function to start readahead is:

460 unsigned long

461page_cache_readahead(struct address_space *mapping, struct file_ra_state *ra,

462 struct file *filp, pgoff_toffset, unsigned long req_size)

Usually it is called by do_generic_file_read() when we request regular file read operations. This function replenishes the current and ahead windows, updating their sizes according to the number of how successfully readahead was in the past accesses to the file.

page_cache_readahead() first checks if the requested read is sequential to the previous read, as we have mentioned. Then there are three cases: (i) If it is sequential and readahead is currently off (ra->size == 0), it calls get_init_ra_size() to get initial current window size, and if request size is sufficiently large, call make_ahead_window() to create ahead window. make_ahead_window() calls get_next_ra_size(ra) to get ahead_window size. (ii) If the requested read is random, turn off readahead, and only read the requested data. (iii) If it is sequential and readahead is currently on, we may need to create a new ahead window either when ahead window does not exist yet or when the requested read crosses into the existing ahead window. In the latter case, the existing ahead window becomes the current window.

Note that the flow diagram about page_cache_readahead() in book “Understanding the Linux Kernel” is about an older version of the function, and there are some mismatchings with the actual code.

(2) Your are required to write a sys_call and test it using user-level programs (75 points / 100 points)

Almost all readahead actions can be observed or derived from “file_ra_structure”. Please write a system call to pass the kernel data structure to the user space and print out values of its fields (you can omit “mmap_hit” and “mmap_miss” as we don’t study file mapped I/O in the lab)

asmlinkage int sys_lab_ra_state(void *usr_ra_state, unsigned int fd)

You should use the exactly same interface for the lab, including function name and parameters, in which “usr_ra_state” refers topointer to the user-level “file_ra_state”, and ‘fd’ is file handler retuned open(). You can use “struct file *filp = fget(fd);” so you can obtain the desired file_ra_state structure “flip->f_ra”. Remember to use “fput(filp)” to decrease use count of the file before you return from the function.

The code for implementation of the sys-call should be named as“lab1-fall07.c” and placed in a new directory named as “/home/accessID/UML/linux-2.6.18.3/my-source”.

The header file should be named as “lab-fall07.h” and be placed in a new directory named as “/home/accessID/UML/linux-2.6.18.3/include/my-linux”.

The instructions posted at the course website ( give you all the details about setting your kernel development environment and adding your sys-call. Please exactly follow them as told.

Once you have the sys-call ready, you can write user-level programs to observe the behaviors of the kernel readahead. To make your job easier, I provide you with these user-level programs. You may choose to use them or make whatever changes as you need, especially for formatting its output as you deem appropriate. BUT please carefully read these codes line-by-line and try to understand them as much as you can. To access the files of the user-level source code, first mount your “root_fs” and then:

cp /home/UML/Lab1-Fall-07/* /home/yourAccessID/UML/mnt/home/. (your user-level code should be in /home of UML Linux.)

Note that your “/home/yourAccessID/UML/mnt/home/” is currently owned by ‘root’ of your kernel and the host Linux doesn’t permit you to make the copy. So you need to start your UML Linux and login as root, and do “chmod 777 /home”. Then you can ‘halt’ the UML Linux and make the copy in the host Linux.

Here are descriptions of the files:

(i) create-file.c: format: create-file file-name file-size-in-page-units

It creates a file whose name is “file-name’ and has the “file-size-in-page-units” number of pages. In each page its page offset is stored. You can use the file to test read & readahead behaviors.

(ii) ra.h

This file defines a constant for your sys-call. Make sure the number is the one that is actually used in your kernel. Also it declares a user-level “file_ra_state”. Make sure it is exactly the same as the one you declare in your kernel.

(iii) trace-ra.c: format: trace-ra config-file-name

The program first reads a config file, from which it gets the name of a data file to and pages to read. It then reads the pages from the file. In the process it prints “file-ra-state” before and after a read operation to allow to observe read & readahead behaviors.

(iv) data-to-rae.txt:

A sample config file for trace-ra. The format of the file is as follows:

File-name, n1, n2, n3, …

In which, nk (k = 1, 2, 3, …) is an integer. If nk > 0, “trace-ra” would access nk pages starting from the current file access position. If nk <= 0, “trace-ra” would change file access position to “-nk” by seeking.

(3) Furthermore, you are required to observe/analyze the behaviors of Linux readahead (25 points /100 points)

Design multiple read sequences that produce various readahead behaviors, specifically variations of window sizes. Explain the variations by citing the codes in get_init_ra_size(), get_next_ra_size(), and/or check_ra_success(), etc.

To submit your work, send me an email before the deadline with subject as “Lab1 submission --- ECE7995-Fall-07”. Once you notify me your work is ready to be graded, I’ll enter your directory, build your kernel, and run your user-level programs. Also, each student should be independently complete the work and individually submit his/her work.