9/5/2007
A New Enterprise Data Management Strategy for the US EPA
Part 3: Integration of Data Tables
Brand L. Niemann, Senior Enterprise Architect,
US EPA Enterprise Architecture Team, and
Co-Chair, Federal Semantic Interoperability Community of Practice (SICoP)
Summary
Part 1 (1) outlined “A New Enterprise Data Management Strategy for the US EPA” based on:
The premise of reusing the data and information, rather than changing the data systems themselves, by putting the business and technical rules, logic, etc. into the data itself using markup languages; and
The concepts and standards of the Semantic Web (also called the Data Web or Web 3.0) were the most important tenets of the reuse are:
Bring the data and the metadata back together.
Bring the structured and unstructured data and information back together.
Bring the data and information description and context back together.
Part 2 (2), also for the Metatopia 2007 Conference (3), shows how the use of high-quality content based on considerable multi-disciplinary subject matter expertise can be reused to build a knowledgebase that contains both an ontology and a database of instances. In this case the knowledgebase contains an inventory of the data assets that shows the critical importance of standardized metadata and cross-agency data sharing to the mission of the US EPA and how the implementation of the data asset inventory supports the four functionalities of DRM 3.0 and Web 3.0, namely, Integration of Data and Metadata, Harmonization, Enhanced Search, and Mashups.
Part 3 shows how to integrate data tables within and across categories in the inventory of the data assets. Systems for exporting relational data to RDF have existed since the beginning of the Semantic Web (4, 5). Recently, the Semantic Web developers have focused on SPARQL query-rewriters and interpreters to access relational data directly. Both of these approaches share an expression of relational data in RDF. Access to this structured data can increase the size and utility of the Semantic Web many times over (6). This position paper for the upcoming W3C Workshop on RDF Access to Relational Databases (6) makes Government data tables and relational databases (e.g. LandView 6 and 7 on DVD) (1) readily accessible for Semantic Web pilots.
Table of Contents
1. Introduction
2. Data Asset Inventory
3. Data Table Integration
4. Recommendations
5. References
1. Introduction
A New Enterprise Data Management Strategy for the US EPA exists in two parts (1, 2) so far for presentation at the upcoming Metatopia 2007 Conference (3). At the recent W3C/WSRI Workshop entitled “Toward More Transparent Government on eGovernment and the Web” (4) SICoP suggested that a clear message about the role of RDF in data exchange and a series of pilots using government data sources would help educate and demonstrate the value of the Semantic Web (aka tha Data Web) to the Federal Government. The W3C has a new Semantic Web Layer Cake (5) in which RDF has moved into the XML space and has been expanded with query and rules! The W3C also has an upcoming Workshop on RDF Access to Relational Databases (6).
RDF has generate renewed discussion among the Data Management Community (7) and the new Semantic Web Layer Cake has engendered considerable discussion among the Ontology Community (8) and The upcoming Workshop on RDF Access to Relational Databases will draw members of the Semantic Web and relational database communities together to examine commonalities, distinctions and next steps for expressing relational data in RDF. Consumers and potential consumers of RDF data will provide use cases and goals (e.g. SICoP).
The ubiquity of relational data makes it an attractive next target for the Semantic Web. Much of the data that is used in automation is stored in relational databases. RDF's grounding in universal terms makes RDF attractive to the relational database community. Expressing relational data in RDF allows them to join relational data with data in other databases or in other forms. Despite the ubiquity and utility of relational data, connecting data between databases remains problematic and resource-intensive. Joining data between independently-developed relational databases requires tedious scripting, data warehousing, or tailored integration systems. RDF queries accessing multiple relational databases have shown that RDF can be used to unify independent relational databases and link in external sources, e.g. documents and data from the Web. The provenance for relational data in the RDF can also be expressed in RDF. All of this data will be available for access by query and rules languages (6).
In several deployed systems, the tuples in a relation are identified by a URL composed of table name and primary key attributes/value pairs. This provides the subject of a set of triples, each expressing the attributes of that tuple. For these, the predicate is composed of table name and the attribute name. The objects are literals to express simple relational attributes and URI references to express foreign key relationships to other tuples. Foreign keys to multiple other tuples are simple expressed as repeated attributes (6).
2. Data Asset Inventory
The summary statistics of the data asset database are summarized in the table below from Part 2 (2) of the A New Enterprise Data Management Strategy for the US EPA.
Category / Concept / Indicator / Metadata / Instance / Data Tables / Provenance / ProvenanceTopic / Question / Name / Data Source / Quality / Exhibit Titles / Elements/ Attributes / Agency:
EPA / Agency:
Non-EPA
Air / 3 / 27 / 27 / 58 / 71 / 23 / 4
Water / 7* / 18 / 18 / 41 / 48 / 9 / 9
Land / 5 / 12 / 12 / 25 / 31 / 5 / 7
Human Health / 3 / 18 / 18 / 39 / 49 / 0 / 18
Ecological Condition / 5* / 11 / 11 / 22 / 22 / 3 / 8
5 / 23 / 86 / 86 / 185 / 221 / 40 / 46
* One question without an indicator.
It is significant to note that more than half the indicators are from non-EPA agencies so data sharing and reuse is critical to EPA’s mission to reporting on the State of the Environment.
3. Data Table Integration
The individual data tables with their elements and attributes (recall Section 2) were compiled into 5 multi-sheet spreadsheets (Microsoft Excel), one for each of the 5 topics in the 2007 EPA Report on the Environment. The multi-sheet spreadsheet for “water” is shown below for the index (table of contents) and the Exhibit 5-2 indicator data tables.
4. Recommendations
Part 1 (1) outlined “A New Enterprise Data Management Strategy for the US EPA, Part 2 (2) showed a knowledgebase that contains an inventory of the data, and Part 3 shows how to integrate data tables within and across categories in the inventory of the data assets. This position paper for the upcoming W3C Workshop on RDF Access to Relational Databases (6) makes Government data tables and relational databases (e.g. LandView 6 and 7 on DVD) (1) readily accessible for Semantic Web pilots.
5. References
(1) A New Enterprise Data Management Strategy for the US EPA, August 15, 2007.
Word: http://colab.cim3.net/file/work/SICoP/EPADRM3.0/BNiemann08152007.doc
PowerPoint: http://colab.cim3.net/file/work/SICoP/2007-11-06/BNiemann11062007.ppt
LandView 6 and 7 (in process): http://landview.census.gov
(2) A New Enterprise Data Management Strategy for the US EPA - Part 2: Inventory of Data Assets, August 29, 2007.
Word: http://colab.cim3.net/file/work/SICoP/EPADRM3.0/BNiemann08292007.doc
(3) Metatopia 2007, November 5-7, 2007, Hosted by Data Management Association of the National Capital Region.
Home Page: http://www.wilshireconferences.com/metatopia/index.html
Agenda: http://www.wilshireconferences.com/metatopia/agenda.html
Authors Abstract: http://www.wilshireconferences.com/metatopia/Sessions/e1.html
(4) Toward More Transparent Government Workshop on eGovernment and the Web, United States National Academy of Sciences, Washington DC, USA , June 18-19, 2007, W3C and Web Science Research Initiative.
http://www.w3.org/2007/06/eGov-dc/agenda.html
(5) Current Semantic Web Layer Cake.
http://www.w3.org/2007/03/layerCake.png
http://ontolog.cim3.net/forum/ontolog-forum/2007-07/msg00256.html
(6) W3C Workshop on RDF Access to Relational Databases, October 25-26, 2007, Boston, MA, USA. http://www.w3.org/2007/03/RdfRDB/cfp
(7) Discussion of RDF at the Data Management Discuss Discussion Board, August 22, 2007. Some highlights: An old idea in a new web form, has more flexibility than a table model, it has a lot of practical problems that the industry is working out, and vendors that have adopted it doing just fine if it does turn out that “RDF eats tables for lunch”.
http://tech.groups.yahoo.com/group/dm-discuss/
(8) Ontolog Forum Discussion of the Current Semantic Web Layer Cake, July 28, 2007, to the present
http://ontolog.cim3.net/forum/ontolog-forum/2007-07/index.html
http://ontolog.cim3.net/forum/ontolog-forum/2007-08/index.html
1