Raquel Prototype (2009) Implementation Documentation

Raquel Prototype (2009)

Implementation Documentation

July 2009

Contents

Figures

1) System Overview

2) Raquel Module Definitions

3) Input Layer Implementation

3.1) Tokenisation Implementation Details

3.2) Compactor Implementation Details

3.3) Recursion

3.4) Parser Implementation Details

4) Raquel Tokens

4.1) Scalar Data Types

5) Execution of Raquel Parse Trees

5.1) Internal Data Representation

5.2) Execution

5.3) Execution Example

6) Source Code Layout

References

Figures

Figure 1 – Arrangement of the Prototype Raquel System

Figure 2 – LexTokens from Tokeniser (excluding Initialiser and Terminator as stripped off by Compactor)

Figure 3 – Token after they are chained together by the Compactor

Figure 4 – Creating a Raquel Token from a Chain List containing One LexToken

Figure 5 – Creating a Raquel Token from a Chain List with multiple LexTokens

Figure 6 – Two Main Token types (and the base RaquelToken)

Figure 7 – Hierarchy of Process Tokens

Figure 8 – Parse Tree for a Raquel Statement

Figure 9 – Raquel Source Code Hierarchy for the 2009 Prototype

Figure 10 – Raquel File List – Top Level

Figure 10.1 – External Interfaces Directory

Figure 10.1.1 - STX (Stack) Directory

Figure 10.1.2 - TabDelimitedStack Directory

Figure 10.2 - Hasher Directory

Figure 10.3 – Relations Directory

Figure 10.4 – Scalars Directory

Figure 10.4.1 – Numeric Types Directory

Figure 10.5 – Tokens Directory

Figure 10.5.1 – Data Tokens Directory

Figure 10.5.2 – Process Tokens Directory

Figure 10.5.2.1 – Assignment Tokens Directory

Figure 10.5.2.1.1 – Operator Tokens Directory

Figure 10.5.2.1.1.1 – Join Tokens Directory

1) System Overview

Implementation of the current system was based on the RAQUEL design principle that it must be of a flexible modular construction. The aim of the modular approach is to allow DBMS administrators to configure RAQUEL the layers and modules required for specific applications. Additionally the modularity needs to allow the easy addition of new types into the Raquel system without the need to change existing code sections. Keeping the system modular in this manner allows for multiple developers to work on the Raquel system, developing their own Raquel “types” for use within the DBMS, without encroaching upon any other developers work.

The aim of this current prototype system (2009) was to develop and implement the following layers of the Raquel system:

Input Layer

The Input layer is responsible for taking as input a Raquel query as a string and processing it into a parse tree for processing by rest of the Raquel Layers and system. The parse tree consists of a number of Raquel Tokens that describe the data or instructions that need to be performed. The input layer is also responsible for a first pass syntactical check of the incoming statement to ensure that it meets some basic rules (some information can not be determined at this level of the query and is checked further in the execution process).

Execution Layer

This layer is responsible for taking a Raquel Parse Tree (the query) and “executing” it. That is, to take the parse tree and use the tokens stored within it to get data, make comparisons and so forth so that it can satisfy the query.

Storage Layer

This layer acts as an interface between the Raquel language and the physical stores of data. The Raquel language deals with data manipulation (in a high level of abstraction) in the form of inserts, retrieves, deletes and modifications. This level of abstraction needs to be converted down to the correct commands that reflect these operations in a given storage stack.

Storage Stacks

The Storage stacks are the physical stores of data used by the Raquel system. These are external to the DBMS application and are accessed via a specific Storage Stack Interface. The Storage Stacks used by Raquel could be any combination including other databases, STX, Text File, Binary data and so forth.

Meta DB (Scaffold)

The Meta DB is itself a database that describes the databases on the system. For this prototype an example Meta DB in a scaffolding / placeholder form was developed to test some of the features of the Raquel language. It is effectively a mirror of the storage layer and storage stacks, and in this example uses a tab delimited text based file as its physical storage mechanism.

Driver Layer / Interface

This layer is external to the DBMS system and has been coined the Driver Layer. This layer is responsible for passing the Raquel query string, to the DBMS Application for processing. This layer also handles any error messages that are returned from the DBMS Application. Errors are described in numeric terms relating to the layer / module that generated the error and a specific error number. The Driver Layer is responsible for converting this error sequence into a human readable form for the user.

Three demo examples were developed to test Raquel and these were:

Linux – Teaching Tool
Win32 – Technical Development / Debugging Tool
Win32 – Financial Application Test

2) Raquel Module Definitions

As described above, the Raquel system consists of a number of layers that define the DBMS Application. The logical arrangement of these layers for the prototype is shown below in Figure 1.

Figure 1 – Arrangement of the Prototype Raquel System

As can be seen by Figure 1, there are some additional helper modules that are not layer specific but are utilised by the entire DBMS Application. These are described below:

Hasher

All strings / text within the Raquel system are converted into unique integers to allow fast comparison of string types.

Scalar Module

Responsible for handling the Raquel Types (float, integer, truth, date, time, and so forth)

Token Module

Responsible for handling the Raquel Tokens within the system

3) Input Layer Implementation

To convert an input Raquel Query String (that has came from the Driver / Interface layer) there are a number of sub-steps that must be performed. These sub-steps are defined as:

Tokenisation

The Tokeniser breaks the Raquel query string (linear text) into a list of Word Tokens (list of individual words), removes white space and also performs a basic balance check on brackets within the statements [Livingstone 09].

Compaction

The compactor takes the list of word tokens from the Tokeniser and attempts to match them into meaningful Raquel tokens.

Parsing

The list of Raquel tokens from the compactor is then parsed and formed into a Raquel Parse Tree. This is the internal form that Raquel queries take within the system.

The input layer also performs in a recursive mode to expand more complex statements with bracketed parameter or literal blocks. When a set of brackets (or braces) are found within a Raquel query, the entire content of the bracketed block is recursively passed into the Tokeniser, Compactor and Parser – the result is then handled accordingly (addition of a sub parse tree, literal or parameters) by the token that will act upon that data.

3.1) Tokenisation Implementation Details

For this implementation of the Raquel prototype the Tokeniser was developed on an existing Tokeniser implementation [Livingstone 09b]. Slight changes have been made to this model as it was discovered that information determined during the tokenisation phase was “thrown away”, and then recomputed during compaction.

To help make tokenisation more efficient when the linear text query is decomposed into individual word tokens the Tokeniser has some idea about the meaning of the token. For example, the Tokeniser knows if a specific word is a “string” (denoted in a Raquel statement by matching single quotes) or a number. Instead of throwing this away, the Tokeniser was modified to use a std::vector of structures (defined as LexTokens), instead of a list of raw strings. A std::vector was used to replace the std::list as it is more optimal, given that we are just adding items to the end of existing LexTokens. The Structure for the LexToken is defined below:

struct SLexToken

{

// Lex Token String

std::stringszToken;

// Token Type

ELexTokenTypeeType;

// Token Start Position

intiTokenStart;

// Token End Position

intiTokenEnd;

};

The LexToken contains the original word token string as before string, but additionally extra known information is detailed about the token, including the type of the token (if know during tokenisation), the start position in the string and the end position in the string. The differing types of token (at a high level) can be seen below:

enum ELexTokenType

{

eLEXTOKEN_NONE= -1,

eLEXTOKEN_STRING,

eLEXTOKEN_KEYWORD,

eLEXTOKEN_NUMERIC,

eLEXTOKEN_OPERATOR,

eLEXTOKEN_PARAMETER,

eLEXTOKEN_LITERAL,

eLEXTOKEN_SQUAREBRACE_OPEN,

eLEXTOKEN_SQUAREBRACE_CLOSE,

eLEXTOKEN_PARENTHESIS_LEFT,

eLEXTOKEN_PARENTHESIS_RIGHT,

eLEXTOKEN_INITIALISER,// tab, tab

eLEXTOKEN_TERMINATOR// RR

};

It should be noted though that not all of these types are defined at the tokenisation stage, but are used later as more detailed information about the Raquel query is determined. During tokenisation the token types that can be set are:

Initialiser
Terminator
Keyword
Operator
String
Numeric
Literal
Parenthesis (left and right)
SquareBraces (left and right)

The remaining types are used later in the Raquel system to fill in more information as it is determined. The result from tokenisation is a std::vector of LexTokens. If the vector is empty then an error has occurred during tokenisation and the Driver Layer will interrogate the error codes to relay this information to the user. If tokenisation is successful then the vector will contain n LexTokens that represent the original linear string Raquel query. This vector of LexTokens is then passed to the compactor for compaction.

The Tokeniser also stores the start and end positions of words in the string as this information is used by the compactor to help determine syntactical breaks.

3.2) Compactor Implementation Details

The compactor has the purpose of taking the results from the Tokeniser (in our case now the LexTokens) and converting those into Raquel Tokens [Livingstone 09c]. The Raquel Tokens are used as input by the Parser. The compactor also carries out further syntactical checking of the Raquel statement.

The first pass in the Compactor implementation attempts to determine more information about the LexTokens that have been passed to it from the Tokeniser. It builds up a “query Cache” for use by the compactor as shown below:

for( unsignedint i = 0; i < LexTokens.size(); i++ )

{

// Try Cache Statement

CRaquelDBMS::Get()->ResolveRelationType( LexTokens[ i ].szToken );

}

The Compaction function looks at the LexTokens and tries to determine if they are one of the following types:

Attribute Name
Scalar Type
Scalar Function
Source
Sink
Stored

If the type can be resolved then the LexToken is updated. After this stage the compactor attempts to recognise specific Raquel keywords by hashing the LexToken name into a numerical ID.

The compactor examines each LexToken in turn and checks its status. If it is a standalone token then it will create a new Raquel Token (of its specified type) and add it to a std::vector of Raquel Tokens. If it is not standalone and it meets one of the following requirements, then the compactor will chain the tokens together.

Start of next LexToken is adjacent to end of current LexToken
Current LexToken is a parameter or Literal
Current LexToken is a square open or close brace

When it chains multiple tokens together a string is formed of the token names, which is then re-hashed into a numeric ID. This ID is checked against the hashed ID for all valid Raquel Tokens that are included in the system. If a match is made then a new Raquel Token is generated and added to the Raquel Token vector. The used tokens are removed from the LexTokens and searching continues until all the LexTokens are used.

There is utility code included (an application in Win32 and source in Linux) in the distribution to allow the generation of Hash Codes for use within the Raquel system.

An example would be the Raquel Statement:

Supplier <--Insert{ Price <-- 10.0; SupplierName <-- 'Widget'; Sno <-- 666 }

The Tokeniser would have generated a vector of tokens (Initialiser and terminator tokens are stripped off by the compactor) as shown below in Figure 2:

Figure 2 – LexTokens from Tokeniser (excluding Initialiser and Terminator as stripped off by Compactor)

The compactor then would have then chained the tokens into separate std::vectors as shown below in Figure 3:

Figure 3 – Token after they are chained together by the Compactor

This individual chain lists are then examined by the Token Handler to see if they match any of the Raquel Token Definitions (see Section Four). To do this each chain list is passed into the CreateChainedToken function which is defined in the Input Layer (InputStack.cpp).

Chain List 0 contains only one LexToken, therefore the function CreateSingleToken is called in InputStack.cpp. The function takes a single LexToken reference as input and uses its szToken property as input to the Hasher modules (Hasher.h) HashStringNC function, which returns a numeric ID for that string, in this case, 1696897498. That numeric string is then passed to the Token Module and CreateRaquelToken is called to see if it can match the ID against any valid registered Tokens within the Raquel system.

Figure 4 – Creating a Raquel Token from a Chain List containing One LexToken

If it can find a match then it creates a new Raquel Token, otherwise it tries to further define what the LexToken represents as one of the following:

Stored Relation
Source
Sink
Attribute Name
Scalar Type
Scalar Function
Parenthesis Left or Right
Square Brace Left or Right
Initialiser or Terminator
Numeric
Literal
Parameter

If the type can not be determined then an error is produced and compaction stops other wise a new Raquel Token is created and added to the list of Raquel tokens.

For Chain List 1 because there are multiple LexTokens in the vector, the vector is iterated and each token name (szToken on LexToken) is taken and concatenated onto a new local string called szToken. If the token type is a Literal or Parameter then the LexToken text is replaced on the concatenated string by “{Literal}” or “[Parameter]” respectively. This allows the Raquel query to be matched to registered tokens as each registered token contains its own definition of how it should match. For instance the Insert token, has an internal definition of “Insert{Literal}” defined as a string and a hash value. It is this hash value that we are attempting to pattern match.

Figure 5 – Creating a Raquel Token from a Chain List with multiple LexTokens

In this example the hash value matches the Insert{Literal} token and a new Insert token is created (with its associated data) and added to the list of Raquel Tokens.

When the compactor has made a match on a chain list it then removes the matched tokens from the query vector and continues the same process until the query vector is empty (all LexTokens have been processed). Any errors during this phase are added to the error vector.

3.3) Recursion

It should also be noted that the above example is relatively simple, but there are cases where the Raquel query can contain sub-statements that need to be recursively expanded. A complex recursive query for example is shown below:

s <--Retrieve Part Join[Sno] Supplier Restrict[Sno < 1 Or ((Sno >1) And (Price < 10))]

It can be seen from this query that the Restrict keyword has a parameter statement with a complex logical statement. When the compactor matches the “Restrict[Paramter]” string and creates a token it will look at the parameter block and search for any braces, brackets or square brackets. If it finds any, it realizes that this statement need expanding and therefore takes the entire contents of the parameter block, in this case “Sno < 1 Or ((Sno >1) And (Price < 10))” and calls the Tokeniser, Compactor and Parser again. This then creates a sub-tree which is attached to the Restrict token. Recursion can happen as many times as required for a complex statement, although this concept needs to be fully tested.

The same situation occurs for Literals too as the following example for shows:

Supplier <--Insert{{Price <-- 10.0; SupplierName <-- 'Widget'; Sno <-- 666} {'widget++'; 12; 9.99}}

Once the Compactor has matched the “Insert{Literal}” and created a new Raquel Insert Token and before it adds the literal data to the token, it examines the entire contents of the literal statement for extra braces ({ or }). If it finds extra braces, then it knows that entire contents of the literal requires recursive calling of the Tokeniser, Compactor and Parser. The resulting data is then attached to the new Insert token as appropriate.

3.4) Parser Implementation Details

The parser takes the list of Raquel Tokens from the Compactor and converts those into a Parse Tree as per the logical design [Livingstone et al. 08]. The parser is part of the Input Stack and is defined in InputStack.cpp as CreateParseTree. The parser uses the vector of Raquel Tokens produced by the compactor to generate the Raquel Parse Tree (RPT).

The parser first checks to ensure that the input vector of Raquel Tokens is of valid size, that is greater than zero (if not it reports an error and ceases processing). This is true for the highest level of input statements and also literal values. However some Raquel commands can have empty parameter blocks. Once the vector has passed the size test the vector is iterated over to look for brackets in a statement (indicating that some recursion is required). If brackets are found in the statement then the open and closing brackets are found and the entire statement within it is then called again with the CreateParseTree function. The contents of the braces are then replaced with the root to a sub-parse tree.

When all sub-statements have been recursively examined and sub-parse trees generated and inserted in the vector, the parser performs its tree building process, as per the logical design. Each Raquel Token has a pointer to its parent token and also its left and right children and these are set by the parser to create the tree. The function returns the root node of the tree