Advanced Compiler Design & Implementation

Advanced Compiler Design and Implementation

Introduction to Advanced Topics

Chapter 1

Eran Yahav

/ Course textbook:
Advanced compiler design and Implementation
Steven S. Muchnick
ISBN 1558603204
Two copies available in TAU library.
Purchase your own copy - send e-mail to

Outline

Review of compiler structure

Advanced issues in elementary topics

The importance of optimizations

Structure of optimizing compilers

Placement of optimizations

Tentative schedule

Course requirements

Compiler Structure

Figure1 - General structure of a non-optimizing one-pass compiler

This section will give a short reminder of basic compiler structure and basic terminology. If you are not familiar with the terms, you can find more elaborate description in [1] or [2].

Lets start by a describing the structure of a non-optimizing one-pass compiler. The structure of such a compiler is brought in details in Figure1.

A simple non-optimizing one-pass compiler consists of the following components/phases:

Scanner (also known as Lexical Analyzer) - input for this phase is the stream of characters representing the source program. During scanning, the stream is read sequentially and characters are grouped into meaningful tokens. Output of this phase is a token stream, to be analyzed by the parser.
Parser (also known as Syntax Analyzer) - input for this phase is a stream of tokens (usually produced by the lexical analyzer). During parsing, tokens are grouped into grammatical phrases. The parsing phase produces abstract syntax tree (AST) representation of the program.
Semantic Analyzer - input for this phase is the AST, produced by the parser. The semantic analysis phase verifies the program in terms of semantic errors, and collects type information later used by the code-generator. Semantic analysis usually produces some form of intermediate representation (IR) of the program.
Code Generator - input for this phase is the intermediate representation (IR), produced by the semantic analyzer. The code generator produces target code using the intermediate representation.

The other two components in Figure1 - symbol table, access routines and OS interface are used by most of compiler phases. The symbol table and access routines record identifiers used in the source program, and collect information about various attributes of each identifier. A symbol table is a data structure containing a record for each identifier with fields for the attributes of the identifier. A naive implementation of a symbol table is quite trivial, but, as we shall later see, efficient symbol table management is not at all simple. Note that the symbol table (and access routines) is usually used by all compilation phases, for example: the lexical analyzer detects the existence of an identifier in the source program, and creates the corresponding record in the symbol table. However, most of the identifier attributes will be only detected by later phases.

The OS interface is used by the compiler to handle files, error handling at the OS level etc.

A more sophisticated compiler would probably contain additional components, such as a code optimizer.

Advanced Issues in Elementary Topics

Although the compiler structure described in the previous section is quite simple, there are some advanced issues to be investigated (textbook chapter brought in parenthesis):

Symbol table management (3)

Efficiency - the symbol table can be trivially implemented as a hash table with identifier names used as a hashing index. Is that the best implementation? Can we use other data structure to achieve better performance?

Overloading - how do we handle overloading (the problem is demonstrated in the following section)

Type Inference - all compilers use some degree of type inference. The problem of type inference can be simply described as - determine the type of an expression based on the types of operands participating in the expression. Although many cases seem to straightforward, in object oriented languages, type inference can get very complicated.
Intermediate Language Selection (4) - selecting an intermediate representation can affect the quality of target code generated, and quality of optimizations performed.
Run-Time Support(5) - the data structures used by the compiler to support the program at runtime, representation of objects during run-time etc.
Producing Code Generators (6)

Symbol table management

Figure 2 - sample program - is symbol table management trivial?

Symbol table management should take overloading into account. Figure 2 shows a sample program in which a number of identifiers "f" are defined, In some program points f is a function, in other, it is a simple integer variable. However, in each program point, the compiler should resolve f to a single entry in the symbol table according to the language semantics.

For example, lets examine the call site denoted "A1". Which f is invoked at "A1"? The function declared at "D1" or that defined in "D2"?

Note that the put procedure is overloaded to handle types boolean, float and integer.

f[D1] is declared with a boolean return type. So is it f[D1] that will be invoked with a put(boolean) procedure ? Taking a careful look at f[D1] reveals that the boolean type returned by f[D1] is a specified type defined inside that package x, rather than the "global" boolean defined in Ada.

Since f[D1] does not return a "global" boolean type, there is no implementation of put that can handle it. Therefore, the site "A1" will invoke put(float) with f[D2].

What about "A2" ? the site "D3" added yet another symbol f to the symbol table, and now as an integer variable. According to Ada's semantics, a locally defined symbol "masks" other identifiers with identical names. In short, the site "A2" will invoke put(integer) on the integer variable f defined at "D3".

Finally, lets take a look at "A3". Now we have four different definitions of f in our program. Which f will be used at "A3" ? Again, language definition comes to our aid, and locality of definition helps us to resolve f to f[D4] and put to put(integer).

As can be seen from our simple example, symbol table management could be quite tricky. This subject is covered in details in chapter 3 of the textbook.

Type Inference

Type inference - the problem of determining the type of a language construct from the way it is used. This term is often applied to the problem of inferring the type of an expression from its arguments or the type of a function from its body.

Figure 3 - type inference example - procedure mlist with procedure parameter p (dragon book)

Figure 3 is an example of type-inference in Pascal, the example was taken from [2].

The procedure mlist takes as parameter another procedure p. By looking at mlist's header, we do not know the number or type of parameters taken by procedure p. This kind of incomplete specification of the type of procedure p is allowed by C and Pascal reference manual. Since mlist invokes p, we can infer the type of p from the use of p inside mlist. p is used in the expression p(lptr), therefore, the type of p must be link → void, i.e., the function p maps values from type link to type void (p is a procedure and therefore its return type is void).

Intermediate Language Selection

Low Vs. High level control flow structures

Flat Vs. Hierarchical (tree)

Machine Vs. High level of instructions

(Symbolic) Registers Vs. Stack

Normal forms (SSA)

Intermediate forms: Control Flow Graph, Call Graph, Program Dependence Graph

Issues: Engineering, efficiency, portability, optimization level

IRs in the Book

Figure 4 - different levels of intermediate representation

Figure 4 shows an example of three different levels of intermediate representation as described in the textbook. The high-level intermediate representation (HIR) uses a high level structure with loops and arrays subscripting. The medium level, uses a high level structure with a low level control flow. Note that the loop control structure is no longer available at the MIR level, and the array access is now expressed in terms of pointer offsets. The low level uses a lower representation, by taking variables and assigning them to symbolic registers.

Single Static Assignment Form (SSA)

Single static assignment form, as its name may hint, is a form of representation in which every variable (or symbolic register) is assigned a value only once.

This representation is easy to create by the compiler, and makes is easy to determine for each variable the site in which it was assigned a value.

The key idea in this form is to use the non-deterministic selection function  where more than one value is available for assignment.

Figure 5 - Single static assignment form

The above figure shows an example of SSA representation of a LIR program. The SSA form given in (b) represents the LIR given in (a). Note the use of the non-deterministic selection function  to choose a value from s21 , the value assigned at the loop header, and s23, the value assigned in the loop body.

Run-Time Support

Data representation and instructions

Register usage

Stack frames (activation records)

Parameter passing disciplines

Symbolic and polymorphic language support

The Importance of Optimizations

Using a non-optimizing one-pass compiler as described in Figure1 will usually result a non-efficient code when compared to more sophisticated compilers.

Figure 6 shows the code generated for the simple C program on an expression by expression basis. As can be easily seen, the non-optimized SPARC code uses 7 instructions, 5 of which are memory access instructions. The non-optimized SPARC code does not take advantage of pre-calculated values (sub-expression for c) or efficient register allocation.

The optimized code is much more efficient and uses only 2 instructions. In terms of run-time improvement, the non-optimized code takes 10 cycles and the optimized code only 2 cycles!

Figure 6 - optimization benefits

In some compilers (or programming languages), it is common for the programmer to aid the compiler by giving optimization "hints". Examples for such hints are the with clause used in Pascal, or the fact that loop index variables are not guaranteed to exist after the loop (this allows the compiler to allocate the index in a register that its value is not stored as the loop ends). Optimizations based on programmer hints are no longer popular. All modern programming styles encourage the use of simple and readable code rather than an optimized non-readable code with compiler hints or constructs chosen to aid the compiler.

Furthermore, as complexity of programs is constantly increasing, it is not likely for programmers to keep optimizing their programs manually.

Letting the compiler work harder (rather than the programmer) and apply optimizations on a "readable" program may open the way for simpler machines and more efficient use of the target machine (RISC, VLIW).

Modularity of modern compilers gives another motivation for using compiler optimizations. Modern compilers often generate non-optimized code to allow reuse of compiler components. If a certain component applies an optimization, it may omit data needed by following components. Furthermore, since the compiler needs to be modular, it consists of distinct modules, each handling specific tasks, and possibly causing code generation to be local (similar to code generation shown in Figure 6).

Application Dependent Optimizations

Different applications/programs may require different optimizations. Often, one wants to develop an application dependent optimization that is targeted towards a specific class of applications.

For example, functional programs can be optimized by replacement of (expensive) recursion by loops (tail-calls elimination). Another common optimization for functional programs is replacement of heap by stack (heap allocation and access is always more expensive than stack access).

Object-oriented programs performance can be improved by dead member elimination (for example see [3]) and replacement of virtual by static function (for example see [?]).

In addition to the above, specific optimizations can be applied on specific types of applications such as numeric code or database access code.

Mixed vs. Low Level Optimizers

The location of the optimization phase is another interesting issue. Should we place the optimizer before code generation, after code generation, both? There are two common approaches for placing optimization phase(s):

Optimizer can be placed after code generation (low-level). Low level optimization is used by HP-PA-RISC/IBM-Power PC. Low level optimization is considered as yielding more efficient code, and considered as conceptually simpler (in RISC). Since optimization is done at the lowest level (after code generation), specific programming language details do not exist and the optimization can be used for multiple programming languages.
Optimizer can be divided to two phases, one preceding code generation, and the other following the code generation phase (see Figure 7). Such optimizer performs both high-level (usually architecture independent) and low-level (architecture dependent) optimization, and thus called mixed optimizer. Mixed optimizers are used by Sun-SPARC, Dec. Alpha, SGI-MIPS, Intel’s 386. Mixed optimizers are considered easier to port since at least part of the optimizer is architecture independent. Since part of the optimization is performed at high-level, it can take advantage of high-level information to achieve more efficient compilation. Mixed optimization supports CISC.

The approaches described above are shown in Figure 7.

In both approaches, the optimizer analyses the intermediate code and tries to take advantage of the specific case introduced. For example consider loop invariant - an expression inside a loop with a result that is independent of loop iteration. Such an expression could be extracted from the loop and evaluated only once. In a loop with large number of iterations, evaluating any expression, simple as it may be, may have a considerable effect on performance. Elimination of such redundant recalculation is only one example of optimizations that are both simple and efficient.

Figure 7 - location of the optimization phase

Translation by Preprocessing

Source to source translation of one programming language to another can produce "cheap" compilation solutions by using a translator and an existing compiler of the target language.

For example, some programming languages are translated into C to be later compiled by a standard C compiler. This allows the language developer to write a translator rather than a whole compiler package.

Real world examples are elimination of includes by C preprocessor, translation of C++ to C (cfront), Haskel, translation of Fortran into “vector” Fortran program etc.

C is very comfortable as an "intermediate language" due to its support of some indirect source level debugging (such as the #line directive). Indirect source level debugging allows showing the original source program, rather than source of the C program (produced by translating the original source program).

By using compiler directives such as #line, we maintain a relationship between the translation (C program) and the original source program files. This relationship can be later used to identify which source program statements are related to statements of the target (C) program.

Data-Cache Optimizations (20)

Processors are getting faster all the time. Memory is lagging behind. Differences between main memory and cache access time are dramatic. By causing data (or instructions) to be read from cache rather than from main memory, one can significantly improve performance. Data-cache optimizations are usually most effective when applied to high-level intermediate code (or source code). Figure 8 shows the phases of an optimizing compiler where data-cache optimization is applied to high-level code. The IBM PowerPC compiler uses a de-compiler to translate a low-level intermediate representation to a higher-level intermediate representation, then using the higher-level IR for data-cache optimization. This allows IBM to place data-cache optimization after low-level IR generator, with the rest of compiler optimizations applied at that phase. The IBM PowerPC compiler is sketched in Figure 9.

Figure 8 - data cache optimization applied on high-level IR

Figure 9 - IBM PowerPC compiler

Placement of optimizations

It is clear that some optimizations are dependent of each other. One optimization can create optimization opportunities for another, but it might as well destroy optimization opportunities.

We want to find an ordering of the applied optimizations in which combined optimizations yield better-optimized code, and do not clash with each other.

Figure 10 - placement of optimizations example

Figure 10 gives an example of optimization placement in a compiler. Optimizations are applied starting with the high-level optimizations and going down to the lower level ones.

Note that some optimizations are applied more than once (such as sparse conditional constant propagation).

Optimizations in Figure 10 are labeled according to the type of code for which they apply:

A - high-level optimizations that are applied on high-level IR or source code. Usually applied during early compilation stages.

B - medium-level or low-level optimizations.

D - usually applied on low-level IR.

E - performed at link time. Applied to object code.

Figure 11 and Figure 12 demonstrate the significance of combining optimizations, and significance of optimization order.

Figure 11 (a) shows a simple program after constant propagation.

Note that evaluation path (marked with thick edges) is determined due to constant propagation - the constant a1 is propagated to the condition a=1. However, constant propagation (usually) does not work on aggregate types, and the value B.x1.0 is not propagated to the following conditional (B.x=1.0).

Part (b) of the figure shows the simple program after scalar replacement (after constant propagation was already applied). Scalar replacement replaces the term B.x by c. Since c is now a scalar value, it is subject to the second pass of constant propagation, and the condition c=1.0 is evaluated to true using the value c1.0 assigned in previous block.