Structuring Data Bases from UML to Code

SQL Lightweight Tutoring Module – Semantic Analysis of SQL Queries based on XML Representation and LINQ

Robert Dollinger

Computing and New Media Technologies Department,

University of Wisconsin Stevens Point,

Stevens Point, Wisconsin, USA,

Abstract: The SQL Lightweight Tutoring Module (SQL-LTM) is capable to provide semantic feedback on otherwise syntactically correct SQL statements pointing out their logic flaws. The target SQL statement provided by the learning student is converted to an intermediate XML representation, which is analyzed and compared to a representation of a test SQL statement by using LINQ to XML. Leveraging the power of LINQ allows a simple and flexible implementation of the analyzer, though powerful and functionally robust. Two main strategies are used in the analyses process: (1) apply transformation patterns to bring the target queries to equivalent representations that can then be compared to the test query representation (2) use the LINQ to XML query capabilities to extract and compare specific query metrics/invariants (list of tables, list of join connections, etc) from the target and test SQL statements. Most conceptual errors SQL learners typically make can be detected in this way.

1. Background and Related Work

Intelligent tutoring systems (ITS) have been an area of active research for quite a long time, according to some sources as early as 1960s (VanLehn 2006). The objective of this research was to produce software artifacts capable to complement or even replace the human expert in the act of teaching. Numerous problem domains have been targeted (mathematics, algorithms, physics, programming languages, etc), and systems of various complexities have been built. Some of the classic research efforts and systems pioneering ITS research include: Algebra Cognitive Tutor (Anderson, Corbett, Koedinger & Pelletier, 1995), Andes (VanLehn et al., 2002; VanLehn et al., 2005), and SQL-Tutor (Mitrovic, 1998a; Mitrovic, 1998b; Mitrovic, Ohlsson, 1999; Martin, Mitrovic, 2002; Mitrovic 2003). Ideally, an ITS is expected to provide a quite challenging range of functionality: provide intelligent feedback to student input, capability to evaluate student progress and adapt instruction to the student‘s needs. Building an ITS is an extremely complex task involving many aspects and requiring solutions from several areas of Artificial Intelligence, most of which are still under research (knowledge representation, natural language processing, belief systems, etc.). Therefore, having robust and flexible ITSs that satisfy most user expectations may still be years away. In spite of all these difficulties there are many ITSs that are quite effective in practice even when only a part of the functionality of an ideal ITS is supported.

With the SQL Lightweight Tutoring Module (SQL-LTM), described in this paper, we take a practical approach based on the recognition of the fact that one can build a quite flexible and surprisingly effective tool with a reasonable design and implementation effort. SQL-LTM is capable to provide semantic feedback on otherwise syntactically correct SQL statements, detecting semantic errors in the query and making recommendations of how to rewrite the query correctly. The approach we take is quite simple: SQL-LTM generates an intermediate XML representation of the target SQL statement provided by a student, which is then analyzed and compared with the representation of a test SQL statement by using Microsoft’s LINQ to XML. Leveraging the power of LINQ allows for a simple and flexible implementation of the analyzer, though powerful and functionally robust.

The SQL-LTM is integrated with an earlier developed Web based AJAX universal query tool called AEQ (AJAX Enabled Query) (Ford, Dollinger 2008; Dollinger, Ford, Helf, Reimer 2008; Dollinger, Bernau, Nadeau, 2009; Dollinger, Ford, Helf, Reimer 2009). AEQ provides basic SQL client functionality, as well as a set of functions for classroom support, course content management and delivery, cooperative work and other.

The remainder of this paper is organized as follows: section 2 places SQL-LTM in the context by comparing its objectives and features with the characteristics of two well known ITS systems for SQL: SQL-Tutor and Acharya; section 3 presents the architecture of SQL-LTM and details the main concepts upon which the implementation of the system is based; section 4 provides some highlights about the context in which the system is used, its integration with the AEQ system, as well as ideas for further developments.

2. SQL Tutor, Acharya, and SQL-LTM

The two classic examples of systems most cited in the literature that seem to be closest to our SQL-LTM in terms what it aims to offer are SQL Tutor (Mitrovic, Ohlsson, 1999; Martin, Mitrovic, 2002; Mitrovic 2003) and Acharya (Bhagat et al., 2002). There are a few similarities between the two systems and ours, but there are also significant differences.

First, all three systems, SQL Tutor, Acharya and SQL-LTM, target the same problem domain, and have the same purpose, that is, assist students learning SQL by detecting the errors in their queries and provide intelligent feedback that guides them in the learning process. Second, all systems use a version, or a representation of, a “correct” solution for the given query as a reference, which is provided by a human expert (instructor). Although, this is in contrast with the typical architecture of an ITS, it appears to be an obvious choice in this scenario since it eliminates the need for a domain module and goes around the likely insurmountable difficulties of building an SQL problem solver, not to mention the problems related to a viable Natural Language Processing (NLP) component that would handle the queries expressed in natural language. In our case this decision makes even more sense considering the integration of the SLQ-LTM with the AEQ tool where part of the standard functionality instructors have access to is to prepare and manage their course materials, thus providing the correct solutions for their exercise queries is considered a normal expectation.

On the other hand there are several differences between the three systems. SQL Tutor is an independent, stand-alone system with no database in the background. SQL Tutor detects both syntactic and semantic errors and provides detailed feedback based on a Constraint Based Reasoning engine where the domain knowledge is represented as a set of constraints. SQL Tutor has come a long way over the years and evolved from an initial system with 199 rules to a powerful and sophisticated tool with about 700 rules as of 2009 (Mitrovic,1998b; Mitrovic et all, 2008). Acharya and SQL-LTM are based on a back-end database that actually processes the queries and provide the return data to be viewed by the student when the input query is syntactically and semantically correct. Otherwise, the appropriate feedback is provided so that the student can fix his/her query. With SQL-LTM we take this approach even further in the sense that the tutoring module activates only if the input query is validated as syntactically correct by the back-end database itself, in an attempt to identify the possible semantic errors in the query. This decision significantly curves down the computational effort in the reasoning engine, and leads to an interactive process where the student is perfecting and refining the query in an incremental manner. In the first phase the syntax errors are eliminated based on the feedback that comes directly from the database. In the second phase, the student receives a record-set as the result to his/her queries, but since syntactically correct queries can still be wrong from a logical point of view, SQL-LTM will return its own elaborate semantic feedback pointing out the errors in the logic of the query, if any.

In terms of the reasoning engine, Acharya uses a three step approach: (1) pre-processing – that applies certain simple transformations on the input queries, (2) atom processing – to process variations within atomic predicates, and (3) truth table processing – to determine if the predicates in the WHERE conditions of the target and the test queries are equivalent. SQL-LTM uses a two step reasoning approach based on the XML representation of the SQL queries and the power of the LINQ to XML technology. The first step is to leverage the LINQ to XML capabilities to apply transformation patterns on the input query to bring it to a “normalized” version, e.g. nested queries that can be transformed into basic join queries. The second step compares the test and the target query representations based on specific query metrics that LINQ to XML queries extract from their XML representations.

As far as the user interfaces are concerned, both SQL Tutor and Acharya have Web based interfaces that allow SQL queries to be input in a constrained format with separate input boxes for each clause (SELECT-FROM-WHERE-GROUP BY-HAVING-ORDER BY) of the SQL query. This seems to be limiting the types of queries the student is allowed to input; e.g. it is unclear if sub-query syntax is supported. By contrast, SQL-LTM is conceived as a plug-in to the AEQ (Ford, Dollinger 2008; Dollinger, Ford, Helf, Reimer 2008; Dollinger, Bernau, Nadeau, 2009; Dollinger, Ford, Helf, Reimer 2009) environment, which supports arbitrary query and database script input in a way similar to those of the vendor provided database clients.

3. SQL-LTM Architecture

Essentially, SQL-LTM consists of two modules: a query parser and an analyzer. The two modules are centered and leverage two important technologies to achieve their goal: XML (eXtended Markup Language) for the internal representation of the SQL queries to be analyzed, and LINQ (Language INtegrated Query), essentially LINQ to XML, as the tool to transform and analyze the XML representation of the SQL queries.

3.1. The Parser

The parser converts the SQL query into an XML representation that captures as much as possible of the information on the semantics of the initial query: structure, result columns, tables, join connections etc. On occasion additional information may be added in the representation to help guide the analyzer; e.g. the category in which the query can be included like: standard join query, negation query, correlated query, division query or other.

Providing a representation that captures all the information the analyzer needs to correctly interpret the query is critical. The XML notation is general enough and a semantically rich representation that can capture the essentials of the SQL queries, and provides a structured representation that can be processed effectively through a variety of languages, tools and APIs: XPath, XQuery, XSLT, LINQ and other.

Query: “Consumers with name Joe having at least one request”
SQL Statement / XML representation
SELECT DISTINCT C.Name,
City
FROM Tb_Consumer C,
Tb_Requests R
WHERE C.Con_ID = R.Con_ID
AND Name=’Joe’ / <Query>
<Select Distinct =”Yes”>
<Columns>
<Column name="Name"
prefix="C" />
<Column name="City"
prefix="C" />
</Columns>
</Select>
<From>
<Tables>
<Table name="Tb_Consumer"
alias="C" />
<Table name="Tb_Requests"
alias="R" />
</Tables>
</From>
<Where> / <And>
<Equal>
<Column name="Con_ID"
prefix="C" />
<Column name="Con_ID"
prefix="R" />
</Equal>
</And>
<And>
<Equal>
<Column name="Name"
prefix="C" />
<String>'Joe'</String>
</Equal>
</And>
</Where>
</Query>
Table 1. Sample SQL Query and Equivalent XML Representation

Both the target or the reference SQL query provided by the instructor, and the test SQL query, the students are inputting, are converted to an equivalent XML representation the analyzer module can then effectively process.The root node for an SQL statement representing a query is <Query>, and contains one child node for each and every SQL clause that appears in the query: <Select>, <From>, <Where>, etc. In addition to that, the <Query> element at the root of the representation can be attributed (annotated) with information that is external to the query itself, but indicates the type of the query or the context in which a solution to it is to be given. The value of this attribute would typically be provided by the instructor while preparing the teaching materials, and represents a hint about the kind of query the XML represents, and the expected SQL technique for solving it. This is because the same query can be solved in different ways by using different techniques presented in different chapters of a course. The instructor may choose the same sample query to be solved on different occasions by using different approaches and SQL techniques. For example the query: “Provide the names of the suppliers not having an offer”, can be solved with set operators by using EXCEPT or MINUS, or with an approach that uses a sub-query introduced by NOT IN. The additional query annotation, indicating the approach to take, will help guide the analyses process. The <Query> opening tag may contain an optional Type attribute with values like: Join – for simple join queries, Nested – for queries containing simple sub-queries, Correlated – for correlated sub-queries, Division - for constructs specific to division queries, and other. This type information is useful for guiding the analyzer in the process of selecting and sequencing the transformation patterns in order to bring the representation of the student’s SQL statement to a form that is equivalent to the one expected by the instructor.

Table 1 shows a sample SQL query and its XML representation.Please note that the parser recognizes the two aliases C and R, associated to the Tb_Consumer and Tb_Requests tables, then it infers from the database catalog that the City and Name (in the WHERE clause) columns belong to the Tb_Consumer table and associates the correct prefixes to the these columns.

3.2. The Analyzer

The analyzer performs a comparative analysis of the test query provided by the student against the reference query created by the instructor by using their XML representation. The analysis is based on a set of heuristics that are applied in two processing phases:

-phase one applies a number of transformation patterns on the test query’s XML representation in order to bring it to an equivalent form that is as close as possible to the reference query representation;

-phase two is where the test and the reference query representations are actually compared based on a set of specific metrics, and semantic errors are detected and reported.

The analyzer uses the power and flexibility of Microsoft’s new LINQ technology (Rattz, 2007), specifically LINQ to XML, to perform its tasks in both phases of the analyses process. LINQ to XML has powerful query and construction features for XML content by which, using a powerful and compact syntax, one can express complex XML transformations, and create sophisticated queries to extract and evaluate relevant information from the XML data. This information is extracted in the form of generic sequences (IEnumerable<T>) of various types. The nature of the generic sequences and the way the LINQ operators work helps abstracting out a wide range of variations across equivalent SQL queries due to reordering of tables, columns, predicates and other query features.

3.2.1. Transformation Patterns

One of the appealing features of XML is the ease with which one can transform one representation into another one while preserving some sort of semantic equivalence. Each XML transformation can be viewed as a combination of query and construction and there are several tools and technologies by which such a task can be performed: XSLT, XQuery, LINQ to XML just to name a few. In our case, the transformation patterns are used to detect query equivalences. Same query may have different SQL solutions when using different resolution techniques, as illustrated in table 2.

Query: “Cities with at least two suppliers”
Self-Join approach / Sub-Query approach / Group By approach
SELECT DISTINCT City
FROM Tb_Supplier S1,
Tb_Supplier S2
WHERE S1.City=S2.City
AND S1.Supp_ID>S2.Supp_ID / SELECT S.City
FROM Tb_Supplier S
WHERE S.City IN
(SELECT City
FROM Tb_Supplier
WHERE Supp_ID>S.Supp_ID / SELECT City
FROM Tb_Supplier
GROUP BY City
HAVING COUNT(*)>=2
Table 2. Equivalent query solutions

Students may often provide a syntactically and semantically correct SQL solution which is structured differently from the one provided by the instructor, due for example to the fact that the student used a different technique to solve the query then the one expected by the instructor for the current lesson. In such cases the analyzer will recognize that the query is correct, but will also provide some recommendation to the student on how to come up with the expected solution.

Recognizing the semantic equivalence of queries that have different structures is something that has not been attempted in SQL-Tutor or Acharya, or in any other system we are aware of, due to the elusive nature and complexity of the task. Our approach is favored by 3 key factors: (1) the XML representation capable to retain semantics while transformed from one form to the other, (2) the availability of effective XML transformation tools and languages like LINQ, and (3) the use of instructor provided annotation that guide the transformation process. Guiding the transformation process refers to what transformation patterns to apply and in what order. The transformations to apply and their sequencing are decided based on a set of heuristics and expert problem domain knowledge. We use expert domain knowledge, result of almost 2 decades of teaching SQL, to create the transformation patterns (essentially LINQ code), and heuristics to indentify which pattern to apply in a given situation. These heuristics are using the instructor provided annotations relative to the query type. In general, several successive transformation patterns may be applied, or the same pattern can be applied recursively before moving to the comparative analysis phase.