Undergraduate Thesis

A Decompiler Project

Undergraduate Thesis

Mohsen Hariri

1. Introduction

2. Boomerang Project

3. Binary Translation

3.1 Front-end

3.2 SLED

3.3 SSL

3.4 RTL

3.5 CFG

3.6 Backends

4. Static Library Detection

4.1 Application

4.2 Library Files

4.3 Matching Methods

4.3.1 Raw Object/Library Matching

4.3.2 Signature Matching

4.4 Symbols Demangling

5. Terminology

5.1 Terms

5.2 Acronyms

6. Project Discussions

1. Introduction

A compiler is a program that takes as input a program written in a high level language and produces as output an executable program for a target machine; in other words, the input is language dependent and the output is machine dependent. Decompiler, or reverse compiler, attempts to perform the inverse process: given an executable program the aim is to produce a high level language program that performs the same function as the executable program. The input in this case is machine dependent, and the output is language dependent.

Decompilation techniques were initially used to aid in the migration of programs from one platform to another. Since then, decompilation techniques have been used to aid in many other fields such as the recovery of lost source code, debugging of programs, comprehending programs, recovery of high level views of programs andworm and virus analysis.

In general, decompilation problem is an equivalent of halting problem.

A naive approach to decompilation attempts to enumerate all valid phrases of an arbitrary attribute grammar, and then to perform a reverse match of these phrases to their original source code. An algorithm to solve this problem has been proved to be halting problem equivalent. A more sensible approach is to try to determine which addresses contain data and which ones contain instructions in the given binary program. Given that in a Von Neumann machine, data and instructions are represented in the same way in the computer memory, an algorithm that solves this data/instruction problem would also solve the halting problem, and that is impossible. This means that the decompilation problem belongs to the class of noncomputable problems; it is equivalent to the halting problem, and is therefore only partially computable. In other words, we can build a decompiler which produces the right output for some input programs, but not for all input programs in general.

(from“A Methodology for Decompilation”, Cristina Cifuentes and K.John Gough)

Today, there are many generic disassemblers available for the public, but decompilers, as they are much harder to design and implement, are very few and almost all of them are designed and programmed for a specific programming language and compiler. A disassembler is a program that reads an executable program and translates it into an equivalent assembler program; a decompiler goes a step further by translating the program into an equivalent high level language (such as C or Pascal) program.

As with most reverse engineering tools, disassemblers and decompilers are semi-automated tools rather than fully automated tools.In effect, decompilation is merely an extension of disassembly. If you can produce assembly language source for a program, then you can produce high-level language source for that program with more effort.

Figure 1 – A Decompilation System

2. Boomerang Project

The Boomerang project is an attempt to develop a real decompiler for machine code programs through the open source community. A decompiler takes as input an executable file, and attempts to create a high level, compilable, possibly even maintainable source file that does the same thing. It is therefore the opposite of a compiler, which takes a source file and makes an executable. However, a general decompiler does not attempt to reverse every action of the decompiler; rather it transforms the input program repeatedly until the result is high level source code. It therefore won't recreate the original source file; probably nothing like it. It does not matter if the executable file has symbols or not, or was compiled from any particular language. (However, declarative languages like ML are not considered.)
The intent is to create a retargetable decompiler (i.e. one that can decompile different types of machine code files with modest effort, e.g. X86-windows, sparc-solaris, etc). It was also intended to be highly modular, so that different parts of the decompiler can be replaced with experimental modules. It was intended to eventually becomeinteractive, a la IDA Pro, because some things (not just variable names and comments, though these are obviously very important) require expert intervention. Whether the interactivity belongs in the decompiler or in a separate tool remains unclear.
By transforming the semantics of individual instructions, and using powerful techniques such as Static Single Assignment dataflow analysis, Boomerang should be (largely) independent of the exact behavior of the compiler that happened to be used. Optimization should not affect the results. Hence, the goal is a general decompiler.

(From Boomerang project homepage, )

The main theories for this project are taken from researches about Binary Translations, mainly UQBT project. These researches on UQBT project have been around since 1995 under the support of Sun Microsytems Laboratories.

Boomerang project started in 2002, and as the time being, it can decompile (very) simple and small programs, generating the equivalent C language code.

One major limitation is that there is as yet no recognition of statically linked library functions (a la FLIRT of IDA Pro). This is a severe limitation for many Windows programs, most of which have large amounts of library code statically linked (as well as plenty that are dynamically linked). That means that until this is implemented, then without significant manual guidance, Boomerang can't even decompile its own test/windows/hello.exe.

(From Boomerang project homepage, )

In this project, I’m going to implement library detection for boomerang. But first, I need to know the project parts and have an overview of the system. So I’ll start by describing the already implemented parts, and then focus on my part in the whole project.

3. Binary Translation

Binary translation is the process of automatically translating a binary executable program from one machine to another. This process normally involves different machines, Mi, different operating systems, OSi, and different binary-file formats, BFFi, in order to translate programs from (M1,OS1,BFF1) to (M2,OS2,BFF2).

Like a compiler, a binary translator can be loosely divided into front end, analyzer and optimizer, and back end. The front end decodes a source-machine binary file, produces RTLs, and then lifts the level of abstraction to HRTL (the high-level intermediate representation) by using knowledge of the source-machine calling conventions and instruction set. The analyzer and optimizer map from source-machine locations to target-machine locations, and it may apply other machine-specific optimizations to prepare for the back end. The back end translates the intermediate HRTL to target-machine instructions, and it writes a binary file in the required format.

Compilers and other tools have traditionally been called retargetable when they can support multiple target machines at low cost. By extension, we call a binary-analysis tool resourceable if it can analyze binaries from multiple source machines at low cost. Retargetability is supported in the UQBT framework through specifications of properties of machines and operating system conventions.

(From UQBT project homepage, )

Figure 2 - UQBT Abstract Architecture

Figure 3 - UQBT Detailed Architecture(2001)

3.1 The Front-end

The front end module deals with machine dependent features and produces a machine independent representation. It takes as input a binary program for a specific machine, loads it into virtual memory, parses it, and produces an intermediate representation of the program.

The loader is an operating system program that loads an executable program into memory, sets up the segment registers and the stack, and transfers control to the program. Note that executable files do not contain much information about which segments are used as data and which ones are used as code, data segments can contain code and/or addresses. The parser decides the type of machine instruction at a given memory location, determines its operands and any offsets involved. The parsing of machine instructions is not as easy as it might appear. First of all, there are addressing modes that depend on the value of variables or registers at runtime. Second, indexed and indirect access to memory locations are difficult to resolve. Third, the complex machine instruction sets in today's machines utilize almost all combination of bytes, and therefore it is very hard to determine if a given byte is an instruction or is data. Fourth, there is no difference as to how data and instructions are stored in memory in a Von Neumann machine. Finally, idioms are used by compiler writers to perform a function in the minimal number of machine cycles, and therefore a group of instructions will make sense only in a logical way, but not individually.

In order to determine which bytes of information are instructions and which ones are data, we start at the unique entry point to the program, given by the loader. This entry point must the first instruction for the program, in order to begin execution. From there on, instructions are parsed sequentially, until the flow of control changes due to a branch, a procedure call, etc. In this case, the target location is like a new entry point to part of the program, and from there onwards, instructions can be parsed in the previous way. Once there are no more instructions to parse, due to an end of procedure or end of program, we return to where the branch of control occurred and continue parsing at that level. This method traverses all possible instruction paths. At the same time, data references are placed in a global or local symbol table, depending on where the data is stored (i.e. as an offset on the stack, or at a definite memory location).

A major problem is introduced by the access of indexed and indirect memory instructions and locations. An idiom is a sequence of instructions which forms a logical entity and has a meaning that cannot be derived by considering the primary meanings of the individual instructions.

To handle these, heuristic methods need to be implemented to determine as much information as possible; analytic methods, such as emulation, cannot provide the whole range of solutions anyway. In general, it is impossible to solve these types of problems as they are equivalent to solving the halting problem, as previously mentioned.

Different problems are introduced by self-modifying code and virus tricks. A way to tackle these cases is to flag the sections of code involved, and comment them in the final program. Assembler code might be all that can be produced in these cases. Even more, a suggested optimal algorithm for parsing consists in finding the maximum number of trees that contain instructions; this is a combinatorial method that has been proved to be NPcomplete. For dense machine instruction sets, this algorithm does not solve the problem of data residing in code segments. The intermediate code generator produces an intermediate representation of the program. It works close together with the parser, invoking it to get the next instruction. Each machine instruction gets translated into an intermediate code instruction, such representation being machine and language independent.

Defined/used (du) chains of registers are also attached to the intermediate instruction; these are used later in the data flow analysis phase.

The quality of the intermediate code can be improved by an optimization stage that eliminates any redundant instructions, finds probable idioms, and replaces them by an appropriate intermediate instruction. Many idioms are machine dependent and reveal some of the semantics associated with the program at hand. Such idioms represent low level functions that are normally provided by the compiler at a higher level

(e.g. multiplication and division of integers by powers of 2). Other idioms are machine independent and they reflect a shortcut used by the compiler writer in order to get faster code (i.e. fewer machine cycles for a given function), such as the addition and subtraction of long numbers. Some of these idioms are widely known in the compiler community, and should be coded into the decompiler.

(from “A Methodology for Decompilation”, Cristina Cifuentes and K.John Gough)

3.2 SLED

SLED (Specification Language for Encoding and Decoding) defines the mapping between symbolic, assembly-language, and binary representationsof machine instructions.

The New Jersey Machine Code (NJMC) toolkit allows users to write machine descriptions of assembly instructions and their associated binary representations using the Specification Language for Encoding andDecoding (SLED). SLED provides for compact specifications of RISC and CISC machines; with 127, 193 and 460 lines of specification for the MIPS, SPARC and Pentium respectively.The toolkit also provides extra support for encoding of assembly to binary code, and decoding of binary to assembly code. For decoding purposes, the toolkit provides a matching statement which resembles the C switch statement. The toolkit generates C and Modula-3 code from matching statements, hence automating part of the disassembly process. The generated code can be integrated as a module of a binary-decoding application.

(From UQBT project documents)

3.3 SSL and RTL

Register transfer lists (RTL) is an intermediate language that describes transfers of information betweenregister-based instructions. RTL assumes an infinite number of registers hence it is not constrained to aparticular machine representation.

More recently, RTL has been used as an intermediate representation in different system tools such as thelink-time optimizer OM, GNU’s compilers, and the editing library EEL. Inall these tools, RTL stands for register transfer language, and the representations vary widely.

(From UQBT project documents)

SSL (Syntax/Semantic Language, or Semantic Specification Language), has been developed in order to describe the semantics of machine instructions. In UBQT, SSL is defined in terms of RTLs.The syntaxof SSL is defined in Extended-Backus-Naur-Form (EBNF).

HRTL is the result of applying some transformations on RTLs, to build a more abstract representation of RTLs.

3.4 CFG

A control flow graph (CFG) is a directed graph that represents the flow of control of a program, thus, it only representsthe flow of instructions (code) of the program and excludes data information. The nodes of a CFG represent basic blocksof the program, and the edges represent the flow of control between nodes. A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end.

Figure 4 - Data Structures to Represent a Binary Program

3.5 Backends

The Back-end generates the final output of the program. It can be an executable file in another machine’s architecture, or a high-level language representation of the input executable file.

4. Static Library Detection

4.1 Application

Executable files contain both user codes and library codes. Being able to distinguish them will let us generate simpler and smaller source files which are also more readable. Furthermore, many library functions are written in assembler and contain instructions and structures that are difficult or impossible to decompile. All programs are more readable when calls to library functions use their symbolic names (e.g. “strcmp” rather than “proc003”)

4.2 Library Files

There are many object and library file formats available, and there are some free software to read and use these library formats. Current mostly used object file formats are:

1. COFF

Common Object File Format, is used by Microsoft compilers. The standards for this format are not completely documented, but there are some free tools to read them.

2. OMF

Designed by Intel, and used by Microsoft before switching to COFF. Borland compilers used this format long after Microsoft. This format is also documented to some extents, but I have not seen any free tools to read this format.

3. ELF

Used by GNU gcc, both for object files and executable files. The standards for this format are available, and freely accessible.

There are other formats like relocatable a.out used on BSD UNIX systems and IBM 360 objects, which I have not mentioned.

BFD library is the GNU library to deal with object, library and executable files. It supports a wide range of object file formats such as COFF and ELF. But it does not have support for OMF format. I am using BFD library to read from object file.

BFD provides a uniform API to work with object files. So that using BFD library will allow boomerang to read any object file supported by BFD. But this method will prevent us from using format specific extra information which could be extracted if we weren’t using BFD.

4.3 Matching Methods

Matching a function in an object file is composed of these steps:

Reading the function op-codes from the object file
flagging the relocatable bytes out
Matching the remaining data pattern against the executable sections

When a match occurs, we will execute the following steps:

Create a virtual procedure in boomerang using Prog::newProc
In C++, parameters and return value are encoded and appended to the procedure name. If we know the demangling schema for the compiler, we can even extract the prototype of the function. For more information see section 4.4.
Attaching the names to any references by this procedure using

BinaryFile::AddSymbol. For example: