CSE333 Final Project Report 12/16/2005

Final Report

for

CSE 333 – Distributed Component Systems

Instructor

Prof. Steven Demurjian

Department of Computer Science and Engineering

University of Connecticut

October 2005

1

CSE333 Final Project Report 12/16/2005

TABLE OF CONTENTS

LIST OF FIGURES …..…………………………………………………………………….iii

CHAPTER

1Introduction

1.1Background

1.2Scope

1.3The Common Thread

2Semantic Web

2.1Background

2.2Semantic Networks

2.3RDF: the Resource Description Framework

2.3.1Practical RDF

2.3.2The Dublin Core

2.4Building a Semantic Web Server

2.5Conclusions

3Software Component Retrieval

3.1Background

3.2Overview of Architecture

3.3XML-based Software Component Specification

3.4Match Definitions

3.4.1Exact Match

3.4.2Generalization Match

3.4.3Specialization Match

3.4.4Partial Match

3.4.5Reference Match

3.5Similarity Measurement

3.5.1Component Similarity

3.5.2Method Similarity

3.5.3Pre-condition Similarity

3.5.4Post-condition Similarity

3.5.5Input Similarity

3.5.6Output Similarity

3.6Illustration

3.7Adoption of RDF

3.8Future Work

4OpenDocument Exchange

4.1Introduction

4.2Format Specifications

4.2.1Meta.xml

4.2.2Settings.xml

4.2.3Styles.xml

4.2.4Content.xml

4.2.5Pictures folders

4.2.6Thumbnail.png

4.2.7Configurations2

4.2.8Manifest.xml

4.3Testing Portability

4.3.1Simple Text Document

4.3.2Advanced Text Document

4.3.3Simple Spreadsheet Document

4.3.4Advanced Spreadsheet Document

4.4Conclusions

5Distributed Warehouse Systems

5.1Data Warehouse Systems

5.1.1Overview

5.1.2Metadata Interoperability in the Data Warehouse Environment

5.1.3An Intelligent Search by applying the Semantic Web

5.2Persons of Interest Tracking

5.3Conclusion

6References

LIST OF FIGURES

FIGUREPAGE

1Semantic Networks...... 3

2RDF Representation...... 4

3RDF Gateway Tool...... 6

4RDFQL...... 7

4bOntology cardinality...... 9

5Overview architecture of the software component retrieval system...... 10

6XML-based software component specification with individual conditions...... 12

7XML-based software component specification with Boolean conditions...... 12

8XML-based software component specification with nested Boolean conditions...... 12

9XML-based software component specification wieh nested conditions...... 12

10XML-based software component specification with an “isa” relation...... 12

11Example query and result from the retrieval process...... 16

12Example meta.xml file...... 20

13Example settings.xml file...... 20

14Example styles.xml file...... 21

15Example content.xml file...... 22

16Text document 1 opened in OpenOffice...... 23

17Text document 1 opened in StarOffice...... 24

18Text document 1 opened in KOffice...... 24

19Text document 2 opened in OpenOffice...... 25

20Text document 2 opened in StarOffice...... 26

21Text document 2 opened in KOffice...... 26

22Simple spreadsheet in OpenOffice...... 27

23Simple spreadsheet in StarOffice...... 27

24Simple spreadsheet in KOffice...... 28

25Advanced spreadsheet in OpenOffice...... 28

26Advanced spreadsheet in StarOffice...... 29

27Advanced spreadsheet in KOffice...... 29

28Metadata bridge broblem...... 33

29A central proprietary metadata repository...... 33

30Common Warehouse Metamodel construction...... 34

31A central CWM metadata repository...... 34

32Full data searching in a data warehouse...... 36

33Proprietary metadata searching...... 36

34Searching an RDF repository mapped from a CWM repository...... 37

35Persons of interest search...... 38

1

CSE333 Final Project Report 12/16/2005

1 Introduction

1.1 Background

The emergence of the Internet and distributed computing has brought new challenges to the software developing community. Today, it is possible for a team of developers to work together on a project, while located in various parts of the world. In addition to being scattered, they may also use different types of computers, operating systems, or software. This has created a desperate need for program and component interoperability.

This interoperability needs to be supported on many levels. This includes creating a standard model that defines how to generate data, how to organize it, and also a standard way to interchange this data between components. Such a format would enable programs to save files using a strict structure, allowing for simple interchange and remote interpretation of the file among a range of machines and programs.

A new standardized format called XML has won over the software community and has gained the recognition of the World Wide Consortium. Unlike HTML, XML does not concern itself with display and format of the text; instead it describes the text and how it is structured within the XML file. With this simple standard, many different types of programs are able to send data between programs and machines in an organized and structured fashion. XMI, or the XML Metadata interchange, uses an XML document for the interchange of metadata. The content of the file is the meta-data itself and the tags are meta-meta data, which are defined by the meta-meta-model MOF, or Meta Object Facility.

1.2 Scope

We have investigated four areas that rely on XML, and which explore a common problem with XML-represented data: that there must also be agreement on the semantic content and markup mechanism before data may be exchanged (what we will refer to as the “Meta Problem”). These areas are:

  • Semantic Web: a next-generation World Wide Web, allowing data and services to be located and used by programs as well as people;
  • Software reuse, facilitating the locating, retrieval, and integration of software components based on requirements;
  • OpenDocument exchange, allowing the creation and exchange of office documents across multiple document editors;
  • Business data warehouse systems, for a more efficient interchange and interoperability of data across distributed data warehouses.

These distinct areas all use XML (and often XMI) to provide interoperability between disparate components, which may evolve independently at different points in time. In each of these areas the challenge of agreeing on XML structure arises, specifically in determining the detailed structure and tags that are to be used in the XML, or in providing some way to negotiate these as needed. In the case of the Semantic Web, there are techniques for exchanging information about the XML representations used; in the Open Office area, an agreed-upon standard is used (OpenDocument). Business database systems may use either technique.

1.3 The Common Thread

The one application of XML that addresses the Meta Problem head-on is the Semantic Web. This is intended to be the next generation of the World Wide Web, containing information marked up in XML so that meaning may be used as a search criterion. As with the World Wide Web, it is intended that this will be done in a decentralized manner by many authors, so no centrally-imposed rigid set of XML guidelines may be relied on for consistency; rather, this is a runtime problem for those searching for information.

Semantic web work has led to an approach to this problem which we will use as the underlying thread unifying our work. In short, this approach involves representing semantic information in a Resource Description Framework (RDF), which is rich enough to allow runtime resolution of the Meta Problem.

2 Semantic Web

By Bryan Bentz

2.1 Background

The Semantic Web is an extension of the current web, in which information is given well-defined meaning, allowing richer interactions with both humans and machines [Bryan-1]. The purpose is twofold: to be a web of distributed knowledge bases, accessible by agents (as well as humans); and to be a repository of web services, allowing agents to locate, select, employ, compose, reuse, and monitor services automatically [Bryan-2].

Distributed knowledge bases use ontologies to define their structures, allowing interoperability of web resources containing related content. Ontologies provide the framework upon which the XML for a given knowledge base is constructed. The subject of ontologies spans ontology representation languages, ontology development, ontology learning approaches, and ontology library systems.

Web services use one of several techniques to advertise their capabilities and allow invocation; software agents must be able to find, communicate with, monitor, control, and handle the output of such services; indeed services may be composed into new web services [Bryan-3]. To do this there are several evolving standards, such as UDDI (Universal Description, Discovery, and Integration); WSDL (Web Service Definition Language) proposed by IBM and Microsoft; DAML (Darpa Agent Markup Language), ebXML (Electronic Business using XML) proposed by the UN. At this point in time, composition of services is largely an ongoing research topic, though some sorts of composition (analogous to UNIX pipes, that is chaining together existing services) is practical today.

All of these applications of the Semantic Web require solving the Meta Problem; some do so by formalizing a set of XML tags (such as ebXML), or by providing some means of dynamically negotiating a common XML tag vocabulary. Specifically, data of a given type may be marked up by different people in different ways, e.g.

<Telephone>(860) 536-1477</Telephone>

<PhoneNumber>8605361477</PhoneNumber>

How do agents, or users performing a semantic search, recognize these as containing the same information?

One answer is to use a richer model of the tags, one that will be adequate to allow an automated determination that the above two lines represent the same information. A bedrock foundation of the Semantic Web is RDF [Bryan-4], which encodes a type of knowledge representation known as a semantic network.

2.2 Semantic Networks

In concept a semantic network is quite simple: it is a directed graph, in which the nodes represent semantic tokens and the edges represent relationships. It was initially developed by Richard Richens of the Cambridge Language Research Unit in 1956 [Bryan-5] as a way of representing the underlying meaning of natural language; for example, the phrase “my kettle and cup of coffee” might be represented as this semantic network:


Figure 1: A Semantic Network

In this network, the labeled nodes represent ‘concepts’ in a sense – and in a complete representation, would consist of semantic networks themselves that fully represent those concepts (that is, consider this to be a network fragment, though in practice it may be all that needs to be represented for a particular application). The links represent relationships, typically subclassing (“ako” = “a kind of”), instances (“inst.”), or other relationships, either ad hoc or defined elsewhere in the network (“contains” in the above example).

Note how useful this representation would be, for example, in performing machine translation of (say) English to French. The English would be used to construct the network from the phrase – and at that point, the network fully represents the phrase, and it has no English grammatical structure imbedded in it. From this representation, a description in French may be built, using the appropriate French grammatical structures to describe this network. Semantic networks are indeed very powerful representational tools and may be used in a number of contexts.

In the Semantic Web use of semantic networks, this sort of translation is what is used to reconcile one set of XML tags with another – rather than translating from English to French, the idea is to translate from my set of tags to your set of tags. In practice we may have to map my semantic network representation to yours as well, but this is now a well-defined operation of identifying matching nodes and typology in our semantic networks.

2.3 RDF: the Resource Description Framework

RDF originated with the work of R.V. Guha at Apple on what was known as the Meta Content Framework. An RDF representation consists of a set of triples, each one of the form:

SubjectPredicateObject


This might seem like a poor way to begin building a sophisticated knowledge representation, but in reality it is an encoding of a semantic network, representing one link at a time:

Figure 2: RDF Representation

Each RDF triple contains 2 objects and a relationship between them; these two objects are the subject and object of the RDF triple, and the relationship is the predicate.

Provided that each object is represented with a unique name, RDF triples may unambiguously encode an arbitrary semantic network. Furthermore, if the names are globally unique (across all systems), different systems may combine their local semantic networks with that of other systems. This is a powerful idea, indeed a surprising one: the Semantic Web at its core involves the construction of one large semantic network, distributed across many, many machines.

2.3.1 Practical RDF

In practice, Universal Resource Identifiers (URI’s) are used to name nodes; these are unique, and furthermore may point to URL’s that contain further information about the object being represented. Ontologies, which are published on the web, consist of semantic network fragments, generally covering some particular domain. Ontology tools let designers work on the semantic network representation from a global perspective. The resulting RDF descriptions may then be exchanged (in an XML format for RDF), merged, or analyzed by semantic web components.

2.3.2 The Dublin Core

One well-known and well-used ontology is known as the Dublin Core [Bryan-6], and defines terms that are used to indicate information about publications; items represented include [Bryan-7]:

Title / Format
Creator / Identifier
Subject / Source
Description / Language
Publisher / Relation
Contributor / Coverage
Date / Rights
Type

Which is a fairly simple set, though the relationships between these elements, and the nature of the data which may exist for each attribute, means that this is actually a relatively large ontology. Because so much of what is currently on the web, and of what is likely to be on the web, may be considered to be a ‘document’, the Dublin Core is widely used.

2.4 Building a Semantic Web Server

Much of the literature about the Semantic Web is hypothetical, or theoretical; to develop an understanding of the pragmatics involved, we implemented an experimental semantic web server, using the RDF Gateway tool from Intellidimension. As a domain area, we chose to represent software components and the relationships between them; for this domain we built an RDF description covering those relationships.

It should be noted that usually the inference power that draws upon an RDF representation is used to draw conclusions about types, for instance about the equivalence of XML tags: this is meta-information about data to be marked up in XML. We felt this would be confusing in the context of an example, as the RDF would then encode meta-information about types, which is meta-meta-information about data to be marked up with tags. While this was just as feasible as what we did, it would not be as clear – it seemed to have too many levels of indirection to be intuitive to the reader. By representing software components (rather than tags used to mark up software component descriptions) we have an example that is more concrete without any loss of generality.

We did spend some time trying to locate published ontologies about software, but found nothing of note. We felt that this was surprising, as it might be a very useful representation to have, as it would allow Semantic Web techniques to be used to identify and search for software components and tools.

The RDF Gateway tool we chose is quite a useful and powerful package, able to interact with existing databases, external data sources, and COM objects.


Figure 3: RDF Gateway Tool

We interacted with RDF Gateway using a browser, and used RDFQL, the RDF Query Language, to establish as well as query the semantic model, which is maintained within an RDF Gateway database.

We represented several components:

  • FFT code in C++
  • FIR filter code in C
  • Hidden Markov model code in C
  • Wavelet transforms in Java

For each component we had the source code (often in multiple files), documentation, and auxiliary files (e.g. Makefiles). We used this to write RDF describing each component; for example, for the HMM, we had:

hmm.cThe basic code file
hmm.hThe header file
hmmrand.cPlatform-independent random nums.
hmmutils.cFile I/O, matrix code
hmmtut.psPostscript documentation

We used the RDFQL query language to instantiate these dependencies one triple at a time, that is walking though each such list building the network representation: a fragment of the semantic network we were representing looks like this:

Figure 4: Our Example Semantic Network

Some of these links represent ‘Requires’, that is for example hmm.c requires Hmmrand.c. Some represent the implementation language, and some represent the documentation. The white boxes represent the abstract algorithm being implemented.

RDF for these links looked like this:

“source/hmm.c”requires “source/hmmutils.c”
“source/hmmtut.ps” documents “source/hmm.c”
“source/hmmutils.c” requires “sysutils.h”

We could then declare an inference rule, e.g.:

INFER {?A 'requires' ?C}

FROM {?A 'requires' ?B}

AND {?B 'requires' ?C};

In which the “?” denotes a variable that will be filled by matching against an RDF triple. This rule says that you may infer that A requires C from having A requires B and B requires C; the inference engine that applies this to RDF does so recursively, so if either B or C requires other components, they too will be inferred to be required by A.

We were successful at requesting the set of dependencies that we’d set up in the semantic network, both for ‘requires’ and ‘documents’; these were illustrative link types, and could trivially be extended to represent the full language environment (compiler, compiler version, etc.) and machine appropriate to each component, or to condition the returned list of source files based upon a given machine type and environment. The inference engine may be looked upon as a basic expert system, which may take input from a semantic web request and operate on that input and the local (and potentially more global) semantic network to compute and return an answer.

We built an interface to the inference engine using the RDF Gateway’s ability to generate HTML as output; it can of course equally output XML for use by other tools.

This experiment let get in to the inner workings of a Semantic Web server, and see concretely how RDF could be used to represent useful information. The power of this representation is sufficient to allow the kind of inferences that are necessary in mapping one set of XML tags into another – where one application might wish to represent software dependences via a directed graph, another might wish to use a list of required files for each component – and we demonstrated can convert from one to the other.

We felt that our choice of RDF as the unifying representational approach running through all of our areas was justified.

2.5 Conclusions

One nagging doubt arose as I began to understand the nature of the Semantic Web. As I remarked in our midterm presentation, I feel I have seen this problem before.

In the 1980’s, Artificial Intelligence was all the rage, and it seemed like answers were just around the corner – after all, the tools (such as semantic networks) were there, it seemed all that needed to be done was to assemble them appropriately for given domains. It didn’t turn out that way – working through the details uncovered problems of progressively greater depth. This didn’t mean that AI was a bad idea or wouldn’t eventually work in the way envisioned – but that it would take a lot more time to get there.