Developing a Mini Linux Shell and Performing Big-Data Processing with the Shell

1. Implement a mini Linux shell that can interpret and execute the commands that are typed in interactively or included in batch files. The shell is designed to be able to execute commands concurrently, in order to finish tasks more quickly. 2. With a batch file and the shell executing the batch file, combine a few tools in Linux to finish a big-data processing task --- finding out most frequently used words on Wikipedia pages.

Objectives

1. To further improve programing skills in C. 2. To learn how to handle processes (creation, communication/synchronization, and termination) in Linux. 3. To learn how to parse command lines. 4. To gain some knowledge on a few commonly-used tools in Linux and other UNIX systems. 5. To gain exposure to the skills of integrating multiple tools. 6. To gain some exposure to big-data processing.

Overview

The shell you implement will be similar to, but much simpler than, the ones you run every day in Linux (e.g., csh, bash, tcsh, etc). You can find out which shell you are running by typing echo $SHELL at a prompt. You may then wish to look at the man pages for sh or the shell you are running to learn more about all of the functionality that can be present. For this project, you do not need to implement much functionality; but you will need to be able to handle running commands in the background and running multiple commands simultaneously.

The shell can work in an interactive mode or a batch mode, depending on how the shell is launched.

1. Interactive mode: If the shell is launched without additional arguments, it works in an interactive mode. It will display a prompt (any string of your choosing) and the user of the shell can type in a command at the prompt. The shell creates a child process that executes the command the user entered. By default, when the child process has finished, the shell shows the prompt for more user input. However, the shell should be able to support executing a command in the background: if a command is ended with a ‘&’ sign, the shell does not need to wait for the finish of the corresponding command (i.e., the child process). Instead, it shows the prompt and starts to accept new commands after it has created the child process.

For example, if the executable file of your program is named shell, the prompt of the shell you are using after you log into Linux is $, and the prompt you choose for your

mini shell is prompt>, the following lines illustrate how your program should function in the interactive mode.

2. Batch mode: If the shell is launched with the pathname of a batch file specified in the command line (i.e., $>shell batchfile), it works in a batch mode. The batch file contains the list of commands that should be executed, one command on each line. In the batch mode, the shell should not display a prompt. However, to help you when you debug your shells and us when we test your program, in the batch mode, the shell should echo each command line it reads from the batch file back to the user on the screen before executing the command line. Despite the differences, the functionality of the shell in the batch mode is essentially the same as that in the interactive mode: read a command; spawn a child process to execute the command; wait for the finish of the child process if the command is not put in background; and repeat the above procedures for other commands.

In addition to normal command lines (e.g., those running existing Linux tools), the shell also responds to the following two special commands (in both interactive mode and batch mode):

1. quit: The shell stops accepting new commands when it sees the quit command on a line or reaches the end of the batch file. The shell should then exit after all running processes have terminated. 2. barrier: In the interactive mode, when the shell sees the barrier command, it stops accepting new commands until all running processes have terminated. It does not show the prompt either during the waiting. In the batch mode, when the shell sees the barrier command, it stops reading new commands until all running processes have been terminated. Command barrier cannot be executed in background. The ‘&’ sign in command ‘barrier&’ should be ignored.

$>shell prompt> grep “foo” file2 This is a line containing foo. prompt> cat file2 This is a line containing foo. This is a line without foo. prompt> sleep 10m& prompt> grep “foo” file2 This is a line containing foo. prompt> barrier prompt> echo “sleep finishes”

Launch the mini shell in interactive mode

Prompt is shown after grep finishes

Sleep is executed in background.

Prompt is shown and new command is accepted before sleep finishes.

The shell waits for the finish of sleep before it shows the prompt. Refer to the barrier command below.

Prompt is shown and new command is accepted when sleep finishes.

In addition to normal command lines (e.g., those running existing Linux tools), the shell also responds to the following two special commands (in both interactive mode and batch mode):

1. quit: The shell stops accepting new commands when it sees the quit command on a line or reaches the end of the batch file. The shell should then exit after all running processes have terminated. 2. barrier: In the interactive mode, when the shell sees the barrier command, it stops accepting new commands until all running processes have terminated. It does not show the prompt either during the waiting. In the batch mode, when the shell sees the barrier command, it stops reading new commands until all running processes have been terminated. Command barrier cannot be executed in background. The ‘&’ sign in command ‘barrier&’ should be ignored.

$>shell prompt> grep “foo” file2 This is a line containing foo. prompt> cat file2 This is a line containing foo. This is a line without foo. prompt> sleep 10m& prompt> grep “foo” file2 This is a line containing foo. prompt> barrier prompt> echo “sleep finishes”

Launch the mini shell in interactive mode

Prompt is shown after grep finishes

Sleep is executed in background.

Prompt is shown and new command is accepted before sleep finishes.

The shell waits for the finish of sleep before it shows the prompt. Refer to the barrier command below.

Prompt is shown and new command is accepted when sleep finishes.

Like other Linux shells, the mini shell should support redirection. If a command is ended with “> pathname”. The mini shell should open the corresponding file, truncate its length to 0, and redirect the output of the command into the file. Refer to the example program on redirection in the slides for other topics of process management.

Note that besides the quit and barrier commands, your mini shell does not need to support other commands without executing files, such as cd, bg, fg, eval, export, popd, pushd, etc.

After the mini shell has been implemented and tested with various commands, you need to generate a batch file and run the mini-shell to execute the batch file. The batch file contains the command required to analyze the web pages in Wikipedia and find out most frequently used words on the pages. The execution of the batch file generates a list of distinct words used in the web pages and the number of occurrences of each word on these web pages. The words are sorted by the number of occurrences in an ascending order. The following is a sample of output generated for 4 Wikipedia pages.

126 that 128 by 133 as 149 or 160 for 164 is 189 on 191 from 345 to 375 advertising 443 a 473 and 480 in 677 of 1080 the

Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In the project, you need to analyze all the pages for the entries with two capital letters. For example, the Wikipedia page for entry “AC” is . You can use the following command to saves the page in AC.html:

wget -O AC.html

A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use html2txt to extract the text content into a text file. For example, the following command extract the content for entry AC into AC.txt

html2txt AC.html

After the contents for all the required entries have been extracted, you need to find all the words using grep. Here, words can be any combinations of letters without white spaces or punctuations.

grep -oh "[a-zA-Z]*" AA.txt AB.txt AC.txt … ZZ.txt > allwords.txt

The above command will find all the words (i.e., occurrences of distinct words) in the text files and save them into allwords.txt, one word (i.e., one occurrence) on each line. You need to replace the ellipsis (…) with correct file names. Note that the words are not sorted. Also, if a word appears multiple times in these text files, it also appears the same number of times in allwords.txt. Read the man page of grep to understand this command.

In order to find most frequently used words, you need to find distinct words and count the number of times that each distinct word appears in allwords.txt (i.e., number of occurrence). This can be achieved by sorting the words in allwords.txt and using uniq to find unique words and count occurrences. Read the man pages of sort and uniq to understand how this can be achieved.

sort –o allwords_sorted.txt allwords.txt

uniq -ic allwords_sorted.txt count_uniqwords.txt

With the list of unique words and the counts of their occurrences, the final step is to sort the unique word list based on their occurrence counts.

sort -k 1,1n count_uniqwords.txt

Other Instructions and Hints on Programming

1. You are to do this project BY YOURSELF. 2. The program for the mini shell is basically a loop: it repeatedly prints a prompt (if in interactive mode), reads and parses the input from stdin (for the interactive mode) or the batch file (for the batch mode), executes the command specified on that line of input, and waits for the command to finish (if the command is in the foreground) or the commands in the background (if a “barrier” command is observed). This is repeated until the user types "quit" or ends their input. 3. Since there may be commands in the background and a command in the foreground, your program needs to wait for the finish of the commands smartly. For example, when there is a foreground command and a background command, your program should not finish waiting and show the prompt if the background command finishes and the foreground command has not. Your program should call appropriate functions for waiting child processes (e.g., wait() and waitpid()) and check the information returned from these functions. Your program should also keep track of how many commands are currently unfinished in the background. 4. You should structure your shell such that it creates a new process for each new command. There are two advantages of creating a new process. First, it protects the main shell process

from any errors that occur in the new command. Second, it allows easy concurrency; that is, multiple commands can be started and allowed to execute simultaneously (i.e., in parallel style). 5. To simplify things for you in this project, we will suggest a few functions you may want to use to make your coding easier. (Do not expect this detailed of advice for other assignments and project!) You are free to use these functions if you want or to disregard our suggestions. To find detailed information on these functions, look at the manual pages (using the Unix command man) and the slides if they have been introduced in the class. You will also find man pages useful for seeing which header files you should include. 5.1 Parsing: Functions strsep() and/or strtok() may be used for parsing a command line. But you may find that strsep() is more useful. You need to choose appropriate delimiters and decide how to move forward. When parsing a command line, your program also needs to determine whether the command will be executed in the background or not and whether the output should be redirected or not. 5.2 Executing Commands: Look into fork(), execvp(), and wait/waitpid(). You may note that there are a variety of commands in the exec family. For this project, execvp() is more suitable than other exec functions. Remember that if execvp() is successful, it will not return; if it does return, there was an error (e.g., the command does not exist). The most challenging part is getting the arguments correctly specified. The first argument specifies the program that should be executed; this is straight-forward. The second argument, char *argv[] is an array of strings, or an array of pointers to characters. For example, if you invoke a program with the following command: foo 205 535 thenargv[0] = "foo", argv[1] = "205" and argv[2] = "535". Important: the list of arguments must be terminated with a NULL pointer; that is, argv[3] = NULL. We strongly recommend that you carefully check that you are constructing this array correctly! 6. Given the large number of commands in the batch file for analyzing Wikipedia pages, you need to write a C program to generate the batch file. 7. To finish the work quickly, in the batch file for analyzing Wikipedia pages, all the wget commands and all the html2txt commands should be put in the background, so that they can execute concurrently. But you need to use a barrier to ensure that pages have been downloaded before they are converted to text files. 8. Your program generating the batch file only takes one argument --- the pathname of the batch file to be generated. Suppose the executable file of the program is batch_generator. The command line for generating the batch file is ./batch_generator pathname_of_batch_file Note, it is important to follow this instruction strictly. Otherwise, when your work is being tested, the batch file will not be generated, and you mini shell will not pass the corresponding tests. 9. Miscellaneous Hints 9.1 Remember to get the basic functionality of your mini shell working before worrying about other functionality. For example, first focus on interactive mode, and get a single

command running (probably first a command with no arguments, such as "ls"). Then, add in the functionality to support execution in the background. Next, work in batch mode. 9.2 We strongly recommend that you check the return codes of all system calls from the very beginning of your work. This will often catch errors in how you are invoking these new system calls. 9.3 Practice defensive programming --- your program needs to respond to different types of inputs in a reasonable manner. By "reasonable", we mean printing out an understandable error message before the program continues processing or exits, depending upon the situation. In the project, you can assume that no command lines will exceed 512-byte long, including ‘\n’. 9.3.1 Your shell should be able to handle the following scenarios, which are not errors. Your shell should not print an error message, or crash, or exit prematurely.  An empty command line.  Extra white spaces within a command line.  Batch file ends without a quit command. The shell should show a new prompt if an empty line or a line with only white spaces is observed, and should function normally if a command line contains extra white spaces or a batch file ends without a quit command. 9.3.2 For the following situation, you should print a message to the user (stderr) and continue processing:  A command does not exist or cannot be executed. 9.3.3 Your program should consider the following situations as errors; in each case, your shell should print a message (to stderr) and exit gracefully:  An incorrect number of command line arguments to your shell program.  The batch file does not exist or cannot be opened.

Instructions on Testing

Beat up your own code! You are the best (and in this case, probably the only) tester of this code before it is graded. Throw lots of junk at it and make sure the shell behaves well. Good code comes through testing -- you must run all sorts of different tests to make sure things work as desired. Don't be gentle -- other users certainly won't be. Break it now so we don't have to break it later.

Test your program on one of the afs servers (afsN.njit.edu, where N is an integer from 1 to 36). Compile your program there with gcc on the afs server. To facilitate grading, your program should compile without specifying the -std option in the gcc command line (on afs servers, gnu90 is the default version for C standard).