Notes on Types in Programming Languages

Notes on Types in Programming Languages

by Geoffrey Smith and Jai Navlakha

Basic Notions

Atypeis a set ofvaluestogether withoperations.Some combinations don't make sense!

3+true

How should a programming language deal with such erroneous combinations?

Type System

Informally, a type system consists of (1) a mechanism to define types and associate them with certain language constructs, and (2) a set of rules for type equivalence, type compatibility, and type inference.

The constructs that must have types are precisely those that have values, or that can refer to objects that have values. These constructs include named constants, variables, record fields, parameters, literal constants, and more complicated expressions containing these. Sometimes, they may also include subroutines.

Type Equivalencerules determine when the types of two values are the same.

Type Compatibilityrules determine when a value of given type can be used in a given context.

Type Inference rules define the type of an expression based on the types of its constituent parts, or sometimes, the surrounding context.

In a language with polymorphic variables or parameters, it may be important to distinguish between the type of a reference or pointer, and the type of the object to which it refers; a given name may refer to objects of different types at different times.

Type-safe Language

A language is type-safe if it does not allow operations or conversions that violate the rules of the type system.

A well-behaved program is one which is written in a type-safe manner.

Untyped Languages

Example: assembly language

Everything is bits. Operations are applied to the bits of their operands, whether or not this makes any sense

Strongly Typed Languages

A language is said to be strongly typed if it prohibits, in a way that the language implementation can enforce, the application of any operation to any object that is not intended to support that operation. For example, Ada and Pascal are two (mostly) strongly typed languages. The loophole in Pascal is untagged variant records.

Weakly Typed Languages

If simple operations do not behave in a way that one would expect, a programming language can be said to be "weakly typed". For example, consider the following program:

x ="5"+6

Different languages will assign a different value to 'x':

One language might convert 6 to a string, and concatenate the two arguments to produce the string "56" (e.g.Javascript, VBscript)
Another language might convert "5" to a number, and add the two arguments to produce the number 11 (e.g.Perl, PHP)
Yet another language might convert the string "5" to a pointer representing where the string is stored within memory, and add 6 to that value to produce an address in memory (e.g.C)
And yet another language might simply fail to compile this program or run the code, saying that the two operands have incompatible type (e.g.Ruby, BASIC)

The main difference, roughly speaking, between a strongly typed language and a weakly typed one is that a weakly typed one makes conversions between unrelated types implicitly, while a strongly typed one typically disallows implicit conversions between unrelated types.C is a weakly typed static language. F# is a strongly typed static language.

Statically Typed Languages

A language is said to be statically typed if it is strongly typed and type checking can be performed at compile time. In the strictest sense of the term, few languages are statically types. In practice, the term is often applied to languages in which most type checking can be performed at compile time, and the rest can be performed at run time. For example, if an expression refers to a[i], then the type of a[i] cannot be determined at compile time because the value of i may be out-of-array-bounds. Ada and Pascal are mostly statically typed languages. Implementations of C rarely check anything at run time. C# is a statically typed language.

Dynamically Typed Languages

A language is called dynamically typed if most of the type checking is performed at run time. Examples: Lisp, Scheme, Smalltalk, PHP, your PCF interpreters.

Static Type Checking

Associate types with eachphraseof a program.

Havetyping rulesto specify which combinations are sensible.

Check programs at compile time to see whether they conform to the typing rules:

If programs includetype annotationsspecifying the types of variables and parameters, the process is calledtype checking.
If programs are written without such type annotations, the process is calledtype inference.

Examples: C++, Java, SML, F#

Advantages of Static Type Checking

Guaranteed type correctness, up to the strength of the typing rules.
No runtime overhead.

Disadvantages of Static Type Checking

Loss of flexibility: typing rules may forceoverspecializationof code.

For this reason, it is useful to havepolymorphismin the type system, so that programs can be used more flexibly.

Two major flavors of polymorphism:parametric polymorphism, as in SML and F#, andsubtype polymorphism, as in object-oriented languages.

Dynamic Type Checking

Tag eachvaluewith its type; check the tags at runtime.

Note that variables and programs do not have types!

Advantages of Dynamic Type Checking

Flexibility

For example, in a PCF functionfun x -> e, the value that gets passed forxcan have any type whatsoever, so long as no erroneous operations arise at runtime.

Disadvantages of Dynamic Type Checking

No guarantees of type correctness: a program can work successfully on many inputs, but may encounter a runtime type error on other inputs.

Example:fun b -> if b then 17 else pred true

Runtime overhead of checking tags.

An Example Static Type System

We consider the typing ofvariables,arrays, andpointersin C.

Here is an example of the sort of C code that we wish to type-check:

{ int a;

int *p;

int q[10];

a = 5;

p = &a;

*p = *p + 1;

p = q;

q = p;

}

There is one type error in this code; can you find it?

Thetypesthat we will use are defined as follows:

Data Types

t ::= int | float | char | void | t*

Phrase Types

p ::= t | t var

We need to type programs with respect to atype environmentE, which maps identifiers to phrase types.

Hence the form of atyping judgmentis

E |= e : p

which can be read as follows: "From type environmentE, it follows that programehas typep".

We now present ourtyping rulesfor C. As in the rules for natural semantics, many rules havehypothesesneeded to derive theconclusion.

(LIT) E |=n :int

E |=r : float

In these rules,nmust be an integer literal (like17) andrmust be a float literal (like3.14).

(ID) E(x) = p// Read this as: “If the type of an identifier x is p in the current type

------// environment E, then we can conclude that in E, x has type p.

E |=x : p

Next we have two rules for typingblocksthat declare new identifiers. The notationE[x : t1 var]denotes anupdated environmentwhich is the same asE, except that the type ofxist1 var.

(LETVAR) E[x : t1 var] |= e : t2// “In the environment E with type of x as t1 var, if you can infer

------// program e to be of type t2, then {t1 x; e} is “well-typed.”

E |= {t1 x; e} : void// “void” or “unit” means that the program is “well-typed.”

(LETARR) E |=e1 :int E[x : t1*] |= e2 : t2

------

E |= {t1 x[e1]; e2} : void

Notice that in rule(LETVAR),xis avariable, while in rule(LETARR),xis aconstant.

Thus in the example program above we get the following typing results:

a :intvar

p :int* var

q :int*

Now we consider the typing rules for manipulating variables. We first remark that a variablexactually hastwo different values---one value is theaddressof the memory cell wherexis stored, and the other value is thecontentsof that memory cell. These two values are known as theL-valueand theR-value, respectively. The reason for these names is shown by an assignment command likex = x+1. Whenxis used on theleft-hand sideof an assignment statement, the value that is needed is (usually) theL-value. And whenxis used on theright-hand sideof an assignment statement, the value that is needed is (usually) theR-value. (These rules hold only "usually", because C's pointer operations lead to exceptions. For instance, in the assignment*p = &q, notice that we actually need theR-valueofpand theL-valueofq!)

Here are the typing rules for variables:

(R-VAL) E |=e : t var

------

E |=e : t

(ASSIGN) E |=e1 : t var E |= e2 : t

------

E |= e1 = e2 : t

(ADDRESS) E |=e : t var

------

E |=e : t*

(L-VAL) E |=e : t*

------

E |= *e : t var

Next we have rules for ordinary addition, and the (dangerous!) rule for pointer arithmetic:

(ADD) E |=e1 :int E |= e2 : int

------

E |= e1+e2 :int

(ARITH) E |=e1 : t* E |= e2 : int

------

E |= e1+e2 : t*

Now comes the rule for array subscripting:

(SUBSCRIPT) E |=e1 : t* E |= e2 : int

------

E |=e1[e2] : t var

Recall that in C,e1[e2]is actuallysyntactic sugarfor*(e1+e2). Notice then that rule (SUBSCRIPT) can actually be derived from the other rules:

E |=e1 : t* E |= e2 : int

------(ARITH)

E |= e1+e2 : t*

------(L-VAL)

E |= *(e1+e2) : t var

Finally, we give rules for typing sequences of statements and while-loops:

(COMPOSE) E |=e1 : t1 E |= e2 : t2

------

E |= e1;e2 : void

(WHILE) E |=e1 :int E |= e2 : t

------

E |= while (e1) e2 : void

Using these typing rules, we can typecheck the example C code above. For example, here is the derivation for*p = *p + 1:

E(p) = int* var

------(ID)

E |=p :int* var

------(R-VAL)

E(p) = int* var E |= p : int*

------(ID) ------(L-VAL)

E |=p :int* var E |= *p : intvar

------(R-VAL) ------(R-VAL)

E |=p :int* E |= *p : int E |= 1 : int (LIT)

------(L-VAL) ------(ADD)

E |= *p :intvar E |= *p + 1 : int

------(ASSIGN)

E |= *p = *p + 1 :int

We find that the only illegal instruction isq = p.

Type Soundness

In any static type system, a fundamental question is this: what is guaranteed about the behavior of well-typed programs?

In C, we don't get good guarantees, since a well-typed program can still crash (or worse!) due to pointer errors.

In SML, in contrast, well-typed programs are guaranteed not to "go wrong".

We'll say a bit more about type soundness in Homework 6.

Subtyping and Coercion

In C, we have the rule

int <= float

together with the rule

(SUBSUMP) E |=e : p1 p1 <= p2

------

E |=e : p2

which lets uscoercefrom a subtype to a supertype.

The combination ofsubtypingandoverloadingcan lead to a tricky issue known ascoherence.

In Java, for example, we haveint <= Stringand we also have+overloaded on bothintandString:

(ADD) E |=e1 :int E |= e2 : int

------

E |= e1+e2 :int

(CONCAT) E |=e1 : String E |= e2 : String

------

E |= e1+e2 : String

But now consider the expression1+2in a context where an expression of typeStringis required. We can do the typing in two different ways:

1 :int 2 : int

------(ADD)

1+2 :int

------(SUBSUMP)

1+2 : String

and also

1 :int 2 : int

------(SUBSUMP)

1 : String 2 : String

------(CONCAT)

1+2 : String

The trouble is that these two typing derivations givedifferent results: the first gives"3"and the second gives"12".

A type system is said to satisfycoherenceif different typing derivations lead to the same semantic behavior. Without coherence, we are forced to know exactly when coercions will occur in order to predict the behavior of a program:

classCoher {

public static void main(String[] args) {

int x = 1;

int y = 2;

System.out.println(x + y + " is the sum");

System.out.println("The sum is " + x + y);

System.out.println("The sum is " + (x + y));

}

In contrast, C's overloading of+onintandfloatsatisfies coherence. Adding twoints and coercing the sum to afloatgives the same result as first coercing theints tofloats and then adding them. (Because of round-off errors, the two results might not beexactlyequal, however.)

More Sophisticated Type Systems

In the programming languages research community, there are continuing efforts to develop ever more sophisticated type systems, which aim to guarantee important correctness and security properties in software. We mention a couple of examples here.

Units of Measurement

In September 1999, NASA's Mars Climate Orbiter was lost while attempting to establish orbit around Mars. Subsequent investigation revealed that the root cause was a failure to convert thruster performance data fromEnglishunits tometricunits.

F#'s September 2008 CTP introduced support forunits of measurementchecking and inference. The basic idea is to introduce typing distinctions among differentfloatvalues. For example, we can declare

[<Measure>] type s

[<Measure>] type m

to represent seconds and meters. Now we can annotate values with units:

> let gravity = 9.81<m/s^2>;;

val gravity : float<m/s ^ 2>

Using a little calculus, we can calculate the impact speed if we jump from a certain height:

> let impactspeed (height : float<m>) = sqrt (2.0 * gravity * height);;

valimpactspeed : float<m> -> float<m/s>

> impactspeed 10.0<m>;;

val it : float<m/s> = 14.0

Now suppose that we don't like the metric system. We can declare

[<Measure>] type f

[<Measure>] type h

[<Measure>] type mile

letf_per_m = 3.28084<f/m>

lets_per_h = 3600.0<s/h>

letf_per_mile = 5280.0<f/mile>

Now we can define

> let englishspeed (height : float<f>) =

impactspeed (height / f_per_m) * f_per_m * s_per_h / f_per_mile;;

valenglishspeed : float<f> -> float<mile/h>

> englishspeed (10.0<m> * f_per_m);;

val it : float<mile/h> = 31.31710909

Notice what happens if we botch the definition:

> let englishspeed (height : float<f>) =

impactspeed (height / f_per_m) * f_per_m * s_per_h * f_per_mile;;

valenglishspeed : float<f> -> float<f ^ 2/(h mile)>

Secure Information Flow

Another example is the use of type systems to enforcesecure information flow. The idea is that it would be useful to classify the variables in a program into differentsecurity levelsbased on the confidentiality (or integrity) of the information they contain. For instance, we could let levelHbe high-security, private information (such as a patient's medical history) and levelLbe low-security, public information (such as a patient's address). Now the concern is that information shouldflowonly in ways that respect these security levels. Specifically, we should allow informationto flow fromLtoL, fromHtoH, or fromLtoH; but we must not allow information to flow fromHtoL, as this would be aleakof confidential information.

For example, ifsecretisHandleakisL, then of course the assignment

leak = secret;

is illegal. But there are more subtle leaks possible, as in the following program, which demonstrates what is called animplicit flow:

if ((secret % 2) == 0)

leak = 0;

else

leak = 1;

This program copies the last bit ofsecrettoleak.

It turns out that it is possible to devise type systems with the property that every well-typed program is guaranteed to leaknoinformation fromHtoL. If you wish to learn more about such type systems, you might want to look at my tutorial paper,Principles of Secure Information Flow Analysis.