Chapter -6: Data Types

Introduction

•A data type defines a collection of data objects and a set of predefined operations on those objects

•A descriptor is the collection of the attributes of a variable. In an implementation, a descriptor is a collection of memory cells that store variable attributes. If the attributes are all static, descriptors are required only at compile time.

•These descriptors are built by the compiler, as a part of the symbol table, and are used during compilation.

•For dynamic attributes, part or all of the descriptor must be maintained during execution. In this case, descriptor is used by the run-time system.

•In all cases, descriptors are used for type checking and to build the code for the allocation and deallocation operations.

•An object represents an instance of a user-defined (abstract data) type

•One design issue for all data types: What operations are defined and how are they specified?

Primitive Data Types

•Almost all programming languages provide a set of primitive data types

•Primitive data types: Those not defined in terms of other data types

•Some primitive data types are merely reflections of the hardware. (E.g. integer types)

•Others require little non-hardware support for their implementation.

•The primitive data types are used with one or more type constructors, to provide the structured types.

Primitive Data Types: Integer

•Almost always an exact reflection of the hardware so the mapping is trivial

•There may be as many as eight different integer types in a language

•Java’s signed integer sizes: byte, short, int, long

•C++, C#, include unsigned integer types ( without sign)

•A signed integer, value is represented in a computer by a string of bits, the left most one represents sign.

•A negative integer could be stored in sign-magnitude notation, in which the sign bit is set to indicate negative and the reminder bits represent the absolute value of the number.

•Most computers now use a twos complement notation to store negative integers (take logical complement of the positive version of the number and adding one.

•E.g. -2 11111110  2= 00000010

Com=11111101

1

11111110

L most bit = -128

Add the rest of bits= 2+4+8+16+32+64=126

Subtract= 128-126=-2

Primitive Data Types: Floating Point

•Model real numbers, but only as approximations. Floating-point values are represented as fractions and exponents.

•Languages for scientific use supports at least two floating-point types (e.g., float(4 bytes) and double(8 bytes).

•The collection of values that can be represented as a floating-point type is defined in terms of precision and range.

•Precision is the accuracy of the fractional part of a value measured as number of bits.

•Range is a combination of the range of fractions and range of exponents.

•Usually exactly like the hardware (i.e. language implementers use whatever representation is supported by the hardware), but not always

•Most newer machines use the IEEE Floating-Point

Standard 754 format

Primitive Data Types: Decimal

•Most large computers that are designed for business applications (money) have hardware support for decimal data types

–Essential to COBOL

–C# offers a decimal data type

•Store a fixed number of decimal digits with the decimal point at a fixed position in the value

•Decimal types are stored using binary codes for the decimal digits (B CD).

•In some cases, they are stored one digit per byte, but in others they are packed two digits per byte. Either way, they take more storage than binary representations. It takes at least 4 bits to code a decimal digit. E.g. 7=0111, 9=1001, 3=0011 etc. to store 6-digit coded decimal number requires 24 bits of memory. Operations on decimal values are done in hardware on machines that have such capabilities; otherwise they are simulated in software.

•Advantage: accuracy (being able to precisely store decimal values)

•Disadvantages: limited range (exponents are not allowed), wastes memory

Primitive Data Types: Boolean

•Simplest of all types

•Range of values: two elements, one for “true” and one for “false”

•C98, exceptions in which numeric expressions are used as conditionals. All operands with non-zero values are considered true, and zero is false.

•C99, C++ have Boolean type. They also allow numeric expression to be used as if they were Boolean.

•Java, C# not allowed.

•Boolean types are used to represent switches or flags in programs.

•Could be implemented as bits, but often as bytes

–Advantage: readability (more readable than using integer)

Primitive Data Types: Character

•Stored as numeric codings

•Most commonly used coding: ASCII (8-bit code): 128 characters, Extended ASCII: 256 characters. Ada uses Extended

•An alternative, 16-bit coding: Unicode

–Includes characters from most natural languages

–Originally used in Java

–C# and JavaScript also support Unicode

Character String Types

•Values are sequences of characters

•Design issues:

–Is it a primitive type or just a special kind of array?

–Should the length of strings be static or dynamic?

Character String Types Operations

•If strings are not defined as a primitive type, string data is usually stored in arrays of single characters, and referenced as such in the language (C, C++).

•C, C++ use char arrays to store character strings and provide a collection of string operations through standard library whose leader file is string.h

•Character strings are determined with a special character, null, represent zero.

•Char *str=”apples”; : str is a char pointer set to point at the string of characters, apples0 where 0 is the null character. This initialization of str is legal because character string literals are represented by char pointers, rather than the string itself.

•Typical operations:

–Assignment and copying. In C, C++

–Comparison (=, >, etc.)

–Catenation

–Substring reference

–Pattern matching

Character String Type in Certain Languages

•C and C++ (C++ supports strings through its standard class library String, also support array of characters)

–Not primitive

–Use char arrays and a library of functions that provide operations header file string.h

–Most commonly used library functions for character strings C, C++ are (stecpy, strcmp, strlen, strcat)

•SNOBOL4 (a string manipulation language)

–Primitive

–Many operations, including elaborate pattern matching

•Java

–Primitive via the String class

•Fortran95: treats strings as a primitive type and provides assignment, relational operators, catenation, and substring reference operations of them (slices).

Character String Length Options

•There are several design choices regarding the length of string values.

•Static length string: the length can be static and set when the string is created. COBOL, Java’s String class. Another Java class called stringBufferclassof changeable values.

•Limited Dynamic Length: allowing strings to have varying length up to a declared and fixed maximum set by the variable’s definition. Such string variables can store any number of characters between zero and the maximum. C and C++

–In C-based language, a special character is used to indicate the end of a string’s characters, rather than maintaining the length

•Dynamic length strings: allow strings to have varying length with no maximum. This option requires the overhead of dynamic storage allocation and deallocation but provides maximum flexibility. SNOBOL4, Perl, JavaScript

•Ada supports all three string length options

Character String Type Evaluation

•Aid to writability. Dealing with strings as arrays is more difficult than dealing with primitive string type.

•As a primitive type with static length, they are inexpensive to provide. Providing strings through a standard library is nearly as convenient as having them as a primitive type.

•Dynamic length is nice and flexible, but is it worth the expense? The overhead of their implementation must be weighted against that additional flexibility.

Character String Implementation

•Static length: compile-time descriptor need only during compilation. Has 3 fields. See figure-1

•Limited dynamic length: may need a run-time descriptor for length to store both the fixed maximum length and the current length. See figure-2 (but not in C and C++ because the end of string is marked with null character. Do not need max length, because index values n array references are not range-checked in these languages). Static and dynamic length strings require no special dynamic storage allocation.

•Dynamic length: need run-time descriptor; allocation/deallocation is the biggest implementation problem which requires a complex storage management. Length and storage to which it is bound grow and shrink dynamically.

•Type approaches to support dynamic allocation/ deallocation

-Strings can be stored in a linked list: drawbacks are extra storage occupied by the links, and the complexity of string operations.

-Or store complete strings in adjacent storage cells. Problem: when a string grows and the adjacent space is not available. Solution is to find another hole that fits the new string, and deallocate the previous hole.

-Although linked-list method requires more storage, the dynamic allocation process is simple, but some string operations are slow due to pointer chasing (sequential access)

-Using adjacent memory for complete strings results in faster string operations and required significantly less storage. But the allocation/ deallocation process is slower.

Compile- and Run-Time Descriptors

User-Defined Ordinal Types

•An ordinal type is one in which the range of possible values can be easily associated with the set of positive integers

•Examples of primitive ordinal types in Java

–Integer, char, Boolean

•In some languages, users can define two kinds of ordinal types: (enumerated and subrange).

Enumeration Types

•All possible values, which are named constants, are provided in the definition

•C# example

enum days {mon, tue, wed, thu, fri, sat, sun};

the enumeration constants are typically implicitly assigned the integer values, 0, 1,… etc.

•Design issues

–Is an enumeration constant allowed to appear in more than one type definition, and if so, how is the type of an occurrence of that constant checked?

–Are enumeration values coerced to integer?

–Are any other type coerced to an enumeration type?

–All these design issues are related to type checking.

•If an enumeration variable is coerced to a numeric type, there is little control over the range of legal operations or its range of values.

•If an int type value is coerced to an enumeration type, an enumeration type variable could be assigned any integer value, whether it represented an enumeration constant or not.

•Design

-If a language does not have enumeration types, we could simulated it

  • e.g. Fortran77: INTEGER RED, BLUE

DATA RED, BLUE /0, 1/

The problem here: since we did not define a type for our colors, there is no type checking when they are used. E.g. it would be legal to add the two together.

Also they could be combined with any other numeric type operand with any arithmetic operator. Also, because they are just variables, they could be assigned any integer value, destroying the relationship with the colors, although to solve this latter issue we could make them named constants.

  • C, Pascal: 1st include enumeration data type. C++ includes C’s enumeration type.

C++ could have enum colors {red, blue, green, yellow, black};

Colors mycolor= blue; youcolor=red;

Enumeration values are coerced to int when they are put in integer context. E.g. if current value of mycolor is blue, the statement mycolor++ would assign green to mycolor.

  • C++ allows enumeration constants to be assigned to variables of any numeric type, though that would most often be an error

- No other type value is coerced to an enumeration type in C++, mycolor=4; is legal, R.H.S sould be cast to C++ enumeration constants can appear in only one enumeration type in the same referencing environment.

  • Ada, enumeration literals are allowed to appear in more than one declaration in the same referencing environment. These are called overloaded literals.

- The rules for solving overloading must be determined from the context. E.g. if an overloaded literal and an enumeration variable are compared, the literal’s type is resolved to be that of the variable.

- Because neither the enumeration literals nor the enumeration variables in Ada are coerced to integer, both the range of operations and the range of values of enumeration types are restricted, allowing many errors to be compiler-detected.

Evaluation of Enumerated Type

•Aid to readability, e.g., no need to code a color as a number. Named values are easily recognized.

•Aid to reliability, e.g., compiler can check: in C#, Ada, Java0.5

–Operations (don’t allow colors to be added). No arithmetic operations are legal on enumeration types.

–No enumeration variable can be assigned a value outside its defined range

–Ada, C#, and Java 5.0 provide better support for enumeration than C++ because enumeration type variables in these languages are not coerced into integer types

•C treats enumeration variables like integer variables.

•C++ numeric values can be assigned to enumeration type variables only if they are cast to type of the assigned variable.

Subrange Types

•An ordered contiguous subsequence of an ordinal type

–Example: 12..18 is a subrange of integer type. Introduced in Pascal, included in Ada.

•Ada’s design

-In Ada,subranges are included in the category of types called subtypes.

-In Pascal

Type

strIndex=0..mastrLength;

var

I: strIndex;

type Days is (mon, tue, wed, thu, fri, sat, sun);

subtype Weekdays is Days range mon..fri;

subtype Index is Integer range 1..100;

- all operations defined for the parent type are also defined for the subtype, except assignment of values outside the specified range. E.g. in the following:

Day1: Days;

Day2: Weekday;

Day2 := Day1;

-The assignment is legal unless the value of Day1 is sat to sun.

-The compiler must generate range-checking code for every assignment to subrange variable sub ranges require run-time range checking.

Subrange Evaluation

•Aid to readability

–Make it clear to the readers that variables of subrange can store only certain range of values

•Reliability

–Assigning a value to a subrange variable that is outside the specified range is detected as an error

Implementation of User-Defined Ordinal Types

•Enumeration types are implemented as integers

•Subrange types are implemented like the parent types with code inserted (by the compiler) to restrict assignments to subrange variables

Array Types

•An array is an aggregate of homogeneous data elements in which an individual element is identified by its position in the aggregate, relative to the first element.

•A reference of an array element in a program includes one or more non-constant subscripts. Such references require a run-time calculation to determine the memory location being referenced.

Array Design Issues

•What types are legal for subscripts?

•Are subscripting expressions in element references range checked?

•When are subscript ranges bound?

•When does allocation take place?

•What is the maximum number of subscripts?

•Can array objects be initialized?

•Are any kind of slices allowed?

Array Indexing

•Specific element of an array is referenced by aggregate name, and subscripts or indexes.

•Indexing (or subscripting) is a mapping from indices to elements

array_name (index_value_list)  an element

•Index Syntax

–FORTRAN, PL/I, Ada use parentheses

•Ada explicitly uses parentheses to show uniformity between array references and function calls because both are mapping. E.g. sum:=sum+B(I);

•When need another information to determine whether B(I) is a function call or an array reference. Reduce readability.

–Most other languages use brackets

–Two district types are involved in an array type: element type, and type of subscripts.

Arrays Index (Subscript) Types

•FORTRAN, C: integer only

•Pascal: any ordinal type (integer, Boolean, char, enumeration)

•Ada: integer or enumeration (includes Boolean and char)

•Java: integer types only

•C, C++, Perl, and Fortran do not specify range checking of subscripts

•Java, ML, C# specify range checking

•Ada checks the range of all subscripts, but this feature can be disabled by the programmer.

Subscript Binding and Array Categories

•The binding of the subscript type to an array is usually static, but the subscript value ranges are sometimes dynamically bound.

•Lower bound of the subscription range, in some languages, is implicit. E.g. C-based fixed to zero, Fortran it default to 1, Pascal subscript ranges must be specified by the programmer.

•There are five categories of arrays, based on the binding to subscript value ranges and the binding to storage.

•Static: subscript ranges are statically bound and storage allocation is static (before run-time)

–Advantage: efficiency (no dynamic allocation)

•Fixed stack-dynamic: subscript ranges are statically bound, but the allocation is done at declaration elaboration time during execution

–Advantage: space efficiency. A large array in one subprogram can use the same space as alarge array in a different subprogram, as long as both subprograms are not at the same time.

•Stack-dynamic: subscript ranges are dynamically bound and the storage allocation is dynamic (done at run-time). Once the subscript range is bound and the storage is allocated, they remain fixed during the lifetime of the variable.

–Advantage: flexibility (the size of an array need not be known until the array is to be used)

•Fixed heap-dynamic: similar to fixed stack-dynamic: subscript range and the storage binding are dynamic but fixed after allocation. The differences are that the bindings are done when the user program requests them, rather than at elaboration time, and the storage s allocated from the heap, rather than the stack.

•Heap-dynamic: binding of subscript ranges and storage allocation is dynamic and can change any number of times during the array’s lifetime.

–Advantage: flexibility (arrays can grow or shrink during program execution)

•Examples of the categories:

  • C and C++ arrays that include static modifier are static
  • C and C++ arrays without static modifier are fixed stack-dynamic
  • Ada arrays can be stack-dynamic. E.g.: