Abdullah Sheneamer 2012

DCSPM

Develop and Compile Subset of

PASCAL Language to MSIL

By

Abdullah Sheneamer

A project submitted to the Faculty of Graduate School of the

University of Colorado at Colorado Springs

in Partial Fulfillment of the Requirements

for the Degree of

Master of Science in Computer Science

Department of Computer Science

Fall 2012

© Copyright by Abdullah Sheneamer2012

All Rights Reserved

This project for the Master of Science degree by

Abdullah Sheneamer

has been approved for the

Department of Computer Science

By

______

Dr. Albert Glock, Advisor

______

Dr. C. Edward Chow, Committee member

______

Albert Brouillette, Committee member

______

Date

DCSPM

Develop and Compile Subset of

PASCAL Language to MSIL

Abstract

The focus of this project is designed the Intermediate language (IL or MSIL) for PASCAL Language. This project aims to design a compiler that can handle subset of PASCAL Language to compile to MSIL including, Assignment statement, Writelineinstructions, If statement, If/else statement, While statement,For statement, Switch statement, If logic statement, and One dimensional array. Also, it’s designed an algorithm that implements the lexical analysis and another algorithm for syntax analysis that the first phase of a compiler is called scanning too. The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes. The second phase of a compiler is called syntax analysis or parsing. The parser uses the first components of the tokens produced by lexical analysis. The compilation time is important so,there are some algorithms and evaluate these different algorithms for their speed performance in Lexical Analysis and Parser which can become bottleneck.

“ILAsm has the instruction set same as that the native assembly language has. You canwrite code for ILAsm in any text editor like notepad and then can use the command linecompiler (ILAsm.exe) provided by the .NET framework to compile that.ILAsm.exeis acommand line tool shipped with the .NET Framework and can be located atwindowsfolder>\Microsoft.NET\Framework\<version> folder. You can include thispath in your path environment variable. When you have finished compiling your .IL file,then it will output the exe with the same name as that of .IL file. You can specify theoutput file name using /OutPut=<filename> switch like ILAsm Test.il/output=MyFile.exe. To run the outputted exe file, just type the name of the exe and hitenter”[11].

Acknowledgements

I would never have been able to finish my dissertation without the guidance of my committee members, and support from my family and wife.

I would like to express my deepest gratitude to my advisor, Dr. Albert Glock, forhis excellent guidance, caring, patience, and providing me with an excellent atmospherefor doing research.

I offer my sincerest gratitude to Dr. Edward Chow, who let me experience the researchof practical issues beyond the textbooks, patiently corrected my writing research and giving important questionable ideas to me through his comments on my proposal.

I would also like to thankAlbert Brouillette for being interested in getting my proposal succeeded and giving important questionable ideas to me through his comments on my proposal.

List of Content

List of Figures

List of Tables

1. Introduction

In the computer world, techniques evolve rapidly from theories, algorithms, programming languages, software systems, and software engineering.

“Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers.”[6]

Fortunately, compilers allow programmers to write at a high level, and automated processing takes care of creating the machine-specific instructions. My project designs and createsa compiler thattranslates PASCAL source code into Microsoft Intermediate Language (MSIL). When compiling the source code to managed code in .Net environment, the compiler translates the source into Microsoft Intermediate Language (MSIL).MSIL includes instructions for loading, storing, initializing, and calling methods on objects, as well as instructions for arithmetic and logical operations. There is currently no PASCAL compiler which compiles to MSIL. The Just-in-time (JIT) compiler will convert the MSIL to CPU- Specific code [1].

The advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures. The JIT compiler can also do aggressive optimizations specifically for the machine where the code is running.

“Before you can run Microsoft intermediate language (MSIL), it must be converted by a .NET Framework just-in-time (JIT) compiler to native code, which is CPU-specific code that runs on the same computer architecture as the JIT compiler. Because the common language runtime supplies a JIT compiler for each supported CPU architecture, developers can write a set of MSIL that can be JIT-compiled and run on computers with different architectures. However, your managed code will run only on a specific operating system if it calls platform-specific native APIs, or a platform-specific class library.

JIT compilation takes into account the fact that some code might never get called during execution. Rather than using time and memory to convert all the MSIL in a portable executable (PE) file to native code, it converts the MSIL as needed during execution and stores the resulting native code so that it is accessible for subsequent calls. The loader creates and attaches a stub to each of a type's methods when the type is loaded. On the initial call to the method, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to direct execution to the location of the native code. Subsequent calls of the JIT-compiled method proceed directly to the native code that was previously generated, reducing the time it takes to JIT-compile and run the code.

The runtime supplies another mode of compilation called install-time code generation. The install-time code generation mode converts MSIL to native code just as the regular JIT compiler does, but it converts larger units of code at a time, storing the resulting native code for use when the assembly is subsequently loaded and run. When using install-time code generation, the entire assembly that is being installed is converted into native code, taking into account what is known about other assemblies that are already installed. The resulting file loads and starts more quickly than it would have if it were being converted to native code by the standard JIT option.

As part of compiling MSIL to native code, code must pass a verification process unless an administrator has established a security policy that allows code to bypass verification. Verification examines MSIL and metadata to find out whether the code is type safe, which means that it only accesses the memory locations it is authorized to access. Type safety helps isolate objects from each other and therefore helps protect them from inadvertent or malicious corruption. It also provides assurance that security restrictions on code can be reliably enforced.

The runtime relies on the fact that the following statements are true for code that is verifiably type safe:

  • A reference to a type is strictly compatible with the type being referenced.
  • Only appropriately defined operations are invoked on an object.
  • Identities are what they claim to be.

During the verification process, MSIL code is examined in an attempt to confirm that the code can access memory locations and call methods only through properly defined types. For example, code cannot allow an object's fields to be accessed in a manner that allows memory locations to be overrun. Additionally, verification inspects code to determine whether the MSIL has been correctly generated, because incorrect MSIL can lead to a violation of the type safety rules. The verification process passes a well-defined set of type-safe code, and it passes only code that is type safe. However, some type-safe code might not pass verification because of limitations of the verification process, and some languages, by design, do not produce verifiably type-safe code. If type-safe code is required by security policy and the code does not pass verification, an exception is thrown when the code is run.” [12]

-Compilation process:takes PASCAL source code and produces MSIL. The PASCAL compiler includes lexical and syntax analysis, and the creation of the symbol table. MSIL is created when compiling to manage native code.MSIL is a CPU-independent set of instructions that can be efficiently converted to native code. Such as in Figure 2.

-Execution process:MSIL must be converted to CPU-specific code, usually by ajust-in-time (JIT) compiler. Native code is computer programming (code) that is compiled to run with a particular processor(such as an Intelx86-class processor) and its set ofinstructions.

1.1. Motivation:

“During compilation of MSIL, thesource codeis translated into MSIL code rather than platform or processor-specificobject code. MSIL is aCPU- and platform-independent instruction set that can be executed in any environment supporting the Common Language Infrastructure,such as the.NET runtimeonWindows, or thecross-platformMonoruntime. In theory, this eliminates the need to distribute different executable files for different platforms and CPU types. MSIL code is verified for safety during runtime, providing better security and reliability than natively compiled executable files. ” [13]

Since, there is currently no PASCAL compiler which compiles to MSIL so, I design MSIL of subset of PASCAL language which has the advantage in compiling to MSIL is that 1) legacy PASCAL can now be run on modern machines, 2) MSIL is platform independent and 3) JIT compilers can be optimized for specific machines and architectures.

2. Background

2.1. Overview of Compilation Process

A compiler is a program that can read a program in one language- the source language – and translate it into equivalent program in another language as Figure 2. An important role of the compiler is to report any errors in the source program that it detects during the translation process [6].

“Microsoft Intermediate Language (MSIL) is a language used as the output of a number of compilers (C#, VB, .NET, and so forth).The ILDasm(Intermediate Language Disassembler) tool that ships with the .NET Framework SDK (FrameworkSDK\Bin\ildasm.exe) allows the user to see MSIL code in human-readable format. By using this utility, we can open any .NET executable file (EXE or DLL) and see MSIL code.

TheILAsm (Intermediate Language Assembler) tool generates an executable file from the MSIL language. We can find this program in the WINNT\Microsoft.NET\Framework\vn.nn.nn directory.Any PASCAL programmer starting with .NET development is interested in what happens in the low level of the .NET Framework. Learning MSIL gives a user the chance to understand some things that are hidden from a programmer working with PASCAL or another language. Knowing MSIL gives more power to a .NET programmer. We never need to write programs in MSIL directly, but in some difficult cases it is very useful to open the MSIL code in ILDasm and see how things are done” [14].

2.2. History

“Pascalis an influential imperative and procedural programming language, designed in 1968–1969 and published in 1970 byNiklaus Wirtha small and efficient language intended to encourage good programming practices using structured programming and data structuring. A derivative known asObject Pascaldesigned for object-oriented programming was developed in 1985.

Pascal, named in honor of the French mathematician and philosopher Blaise Pascal, was developed byNiklaus Wirthand based on the ALGOLprogramming language

Prior to his work on Pascal, Wirth had developedEulerandALGOL Wand later went on to develop the Pascal-like languagesModula-2andOberon.

Initially, Pascal was largely, but not exclusively, intended to teach students structured programming.A generation of students used Pascal as an introductory language in undergraduate courses. Variants of Pascal have also frequently been used for everything from research projects to PC games and embedded systems.. Newer Pascal compilers exist which are widely used” [15].

Grace Murray Hopper coined the term compiler in the early 1950s. Translation was viewed as the “compilation” of a sequence of machinelanguage subprograms selected from a library. One of the first real compilers was the FORTRAN compiler of the late 1950s. It allowed a programmer to use a problem-oriented source language. Ambitious “optimizations” were used to produce efficient machine code, which was vital for early computers with quite limited capabilities. Efficient use of machine resources is still an essential requirement for modern compilers [16].

3. Design

3.1: Introduction to Symbol Table and Lexical Analysis

A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier (information about storage allocation, type,…, etc.). When the lexical analyzer detects an identifier in the source, the identifier is entered into the symbol table. However, its attributes will be entered in the following phases. These attributes are also used later phases.

The lexical Analysis is the first phase of a compiler is called lexical analysis or scanning. The lexical Analysis reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes.

3.1.A.Symbol Table Design

Every key word is a token and has a unique integer code as shown in the table:

Token Code / Keyword
300 / Begin
323 / If
302 / For
305 / Switch
376 / While

So, The identifier token has a code 256, the number token has a code 257, and every special character is a token and has an integer token code equals its ASCII number.Tokens of two characters have unique toCodes as shown in the below table:

Token Code / Tow – Characters Tokens
406 / !=
407 / ==
408 / <=
409 / >=

A token in an instance of the class as shown in the below figure:

3.1.B. Lexical Analysis Design

after reading next character from input stream ;

State 0 : identify the current token and decide thenext state ;

State 1 : Handle identifiers and keywords.

State 2: Handle Number .

State 3 : Handle one – character token or two –character token .

State 4,5 : Handle Comments “\\” or “\*”, skip the line start with “\\” or skip the data between “\*” and “*\”.

3.1.Parser and MSIL (Microsoft Intermediate Language) of PASCAL

Language Design

3.1.1: Introduction to Parser (Syntax Analysis)

The parser inputs the stream of tokens into a hierarchical structure represented by a syntax tree. A typical data structure for the syntax tree of this example “ position := initial + rate * 60 token stream is shown below:

Position

“Grammar is used throughout the parser to organize compiler front ends. A grammar naturally describes the hierarchical structure of most programming language constructs such as the Pascal Language. For example, an if- else statement in Pascal language can have the form

If (expression) statement else statement.

That is, an if-else statement is the concatenation of the keyword if , an opening parenthesis, an expression, a closing parenthesis, a statement, the keyword else, and another statement. Using the variable expr to denote an expression and variable stmt to denote a statement, this structuring rule can be expressed as:

stmtif (expr) stmtelsestmt

in which the arrow may be read as “ can have the form” Such a rule is called a production. In production, lexical elements like the keyword if and the parentheses are called terminals. Variables like expr and stmt represent sequence of terminals and called non-terminals” [6].

3.1.2. Parser (Syntax Analysis) Design

Parsing is the process of determining if a string of tokens can be generated by a grammar. To parse Pascal, it is sufficient to make a single left to right scan over the input, looking ahead one token at a time. Top-Down parsing constructs the nodes of a parse tree starting at the root and proceeding towards the leaves such as the simple example in Figure 8. To construct the parse tree, start at the root and repeatedly do the following two steps:

1-At the function “OneDimArray” construct children at “n”. For the symbols on the right side the production.

2-Find the next node at which a sub tree is to be constructed.

“Note: When starting with nonterminalOneDimArray at the root, we should use a production for OneDimArray that starts with lookahead symbol array. The lookahead symbol always contains the next token to be parsed in the input stream.” [6]