© 2012 PCA. All rights reserved.

JORD

&

JORD PROTOTYPE TRIPLE-STORE & ENDPOINT
Implementation Report & Documentation


JORD (Joint Operational Reference Data) Project

enhancing the

PCA Reference Data Service (RDS) Operation
in partnership with FIATECH

Rev / Date / Description / By / Check
V1 / 23rd Nov 2011 / Initial contractor draft for technical review. / MRS / ISG
V1.1 / 2nd Jan 2012 / First version circulated to JORD members. / MRS / ISG / RDC
V2 / 18th Jan 2012 / JORD deliverable for project & external use. / ISG


Executive Summary

ISO-15926, the standard for lifecycle integration and interoperability, is based on highly generic information modeling principles, and has a high dependency on shared reference data. To maximize the flexibility and availability of reference data in ISO15926 compliant forms across distributed business users, ISO15926 adopts “Triples” as the most generic representation of all semantic content, where each element is represented by a URI (web-address) resolvable by browsers and queries through an “EndPoint”.

This document reports and documents the implementation of the Triple Store and Endpoint created as the JORD prototype for reference data publishing to be supported by the enhanced PCA RDS being created in partnership with FIATECH.

The work was performed by DNV under their existing services contract with PCA and the content of this report and documentation is taken in its entirety from
DNV Report No: PP022471-13O34N0-2

The resultant triple-store & endpoint with user instructions are found here:

http://posccaesar.org/endpoint/

The PCA RDS support services are described here:

https://www.posccaesar.org/wiki/Rds

Acknowledgements

JORD Charter Member organizations contributed funding and direction to the project:

Full sponsors

BP,

EPIM,

RosEnergoAtom,

Black & Veatch,

CCC,

Hatch

and VNIIAES

Supplementary subscribers

Woodside

and Bechtel


Table of Contents

Conclusive Summary 4

1 IntroductioN 4

1.1 Abbreviations and terms 4

2 report 5

2.1 Overall System Layout 5

2.1.1 Note on 32 vs 64 bit 5

2.2 Hardware 5

2.3 HTTP Server (Apache) 6

2.3.1 Redirect posccaesar.org to the wiki pages 6

2.3.2 Mounting servlets 6

2.3.3 Configuration 7

2.3.4 Technical Summary 7

2.4 Cygwin 7

2.5 Servlet Container (Apache Tomcat) 8

2.5.1 Technical Summary 8

2.6 Triple Store 8

2.6.1 RDF Graphs, and Graphs in the triple store 8

2.6.2 Loading the triple store 9

2.6.3 Updating the JORD triple store 10

2.6.4 Technical Summary 10

2.7 Joseki (SPARQL Endpoint) 10

2.8 The SPARQL HTML frontend 11

2.8.1 Queries 12

2.8.2 Technical Summary 13

2.9 Pubby (Linked Data Server) 13

2.9.1 The reason we need this software 13

2.9.2 How Pubby works 13

2.10 Generating the RDS-WIP to PCA Map 15

3 conclusions 17

4 References 18

4.1 Software download locations 18

4.1.1 Pubby 18

4.2 Apache HTTP server 18

4.3 Apache Tomcat 18

4.4 Cygwin 18

4.5 Joseki 18

4.6 TDB 18

APPENDIX 1 END POINT USER GUIDE
APPENDIX 2 LINKED DATA USER GUIDE


Conclusive Summary

This report constitutes the final delivery in the project:
JORD Phase 1 End-Point & Triple Store – Prototype.

It documents the various system design choices made,
and also includes a user guide for the two end-user applications.

1  IntroductioN

This is a technical overview for the Phase 1 Prototype endpoint. It contains details concerning what software it incorporates and how the different pieces of the system play together. The intended audience is project managers, technical staff and to some degree end users. As a result the text may be to verbose for some. In an attempt to make the text easier to read, each component description includes a technical summary, which is intended to give a concise description of the nuts and bolts for the technical audience.

A number of third party applications have been used in this project. These are all open source programs with well written documentations, including installation- and usage instructions. To duplicate these instructions in this document was deemed unnecessary. The configurations of the applications however, are relevant, and are included where applicable.

In the section concerning Pubby, the Linked Data servlet, some general notes concerning Linked Data have also been included.

For sections regarding software exposed to the end-user there is a usage description.

Abbreviations and terms

·  PCA – The POSC Caesar Association

·  PCA RDL – The PCA Library

·  DNV – Det Norske Veritas, the company responsible for the project execution.

·  The Phase 1 Prototype, the prototype – The entire solution described in this document. Including the endpoint and triple stores.

·  The protoype server – The server that holds the Phase 1 Prototype.

·  LD – Linked Data

·  URI – Uniform Resource Identifier

·  URL – Uniform Resource Locator

·  Sogndal server - The server that holds the PCA Wiki. http://www.posccaesar.org

·  Resolve, resolveable - If an URI can be written in a web browser and return a resource such as an HTML page, the URI is said to resolve or be resolveable.

2  report

Overall System Layout

1

The various boxes outside the PCA & DNV and iRING Endpoint boxes, will be described in further details in this document.

§  Note on 32 vs 64 bit

It is generally accepted that 64 bit is better than 32 bit, although the exact advantages are sometimes hard to grasp. Instead of considering the pros and cons for each software choice, the project has chosen 64 bit over 32 bit unless there is a compelling reason to do otherwise.

o  Hardware

The server is a Virtual Server with 64 bit Windows Server 2008 Enterprise Virtual Server. It has the equivalent of three 2.27 GHz processors and 6 GB of ram. A Windows platform was chosen because DNV use Microsoft as its main software provider. However, all the software used is available on e.g. GNU/Linux as it is open source and binaries exist for many platforms. Apache was chosen before Internet Information Services (IIS, the Microsoft web server) partly to accommodate portability to other platforms.

Aside from the triple store and the particular Java programs in the prototype, the prototype rests on a software environment based on various open source software described below.

HTTP Server (Apache)

The Apache HTTP server is an open source HTTP server with wide support and use. It is sponsored by many big corporations, among them Google, Microsoft and IBM. It is said to be used by over half of all the web servers on the internet.

Although the development server is a 64 bit server, the Apache installation is 32 bit. 32 bit was chosen because the jk_mod apache module required for communication with Tomcat wasn’t available 64 bit. This HTTP server will not be used for any heavy lifting, and I believe there are no practical advantages of having 64 bit over 32 bit. Tomcat on the other hand will do heavy lifting, which is why the 64 bit version is installed.

It should also be noted that the version of Apache used is 2.0, not 2.2 which is the last release. The reason behind this is the same as the one for 32 bit vs 64 bit. We were unable to find certain modules pre-compiled for 2.2. While compiling them ourselves certainly is possible, we couldn’t justify the overhead in development time this would cause.

The web server is used to serve all the content of the prototype, directly or indirectly. It is therefore in control of all access restrictions, and also the redirection of requests to the correct java servlets such as the Joseki endpoint and the linked data pages.

It should be noted that the servlet container could have handled most, if not all, the capabilities of the web server, as it contains its own built-in http server. However, the infrastructure for delegating certain tasks to a dedicated web server were already in place on the development server, so these capabilities of Tomcat were not used.

§  Redirect posccaesar.org to the wiki pages

The domain name posccaesar.org is connected to the prototype server IP, while the domain name www.posccaesar.org is connected to the Sogndal server, containing the PCA Wiki pages. Previously there was a so-called WWW redirect from posccaesar.org to www.posccaesar.org, ensuring that users would get to the wiki pages whenever they navigated to the posccaesar.org. This redirection was previously handled by the domain registrar, but is now handled by the development server under Apache.

The redirect was achieved using a RedirectMatch directive, as can be seen from the configuration snippet below.

§  Mounting servlets

The servlet container is connected to the web server through an Apache module called mod_jk. This module conveys the requests from Apache to Tomcat in plain text.

The JkMount lines in the configuration files tells the server to mount certain folders under posccaesar.org to the Jord Tomcat server.

§  Configuration

What follows is a snippet from the Apache httpd.conf configuration file. Note that relevant modules must be enabled in order for the directives to have effect. For instance: the mod_rewrite module must be enabled for rewriting to occur.

<VirtualHost *:80>
ServerName posccaesar.org
DocumentRoot "D:/htdocs-posccaesar"
JkMount /endpoint* jord
JkMount /rdl* jord
<Location "/wiki">
RewriteEngine On
RedirectMatch ^/(wiki.*)$ https://www.posccaesar.org/$1 [R]
</Location>
<Directory />
Options FollowSymLinks
AllowOverride None
RewriteEngine On
RedirectMatch ^/$ http://www.posccaesar.com [R]
</Directory>
</VirtualHost>

§  Technical Summary

For the prototype, the apache server is simply redirecting requests to the Tomcat server, and redirecting requests intended for the PCA wiki. Aside from setting up this functionality, no configuration is necessary.

All servlets must be set up individually, which means that e.g. access restrictions can be applied to individual servlets. It also means that a restart of the apache server is needed when you introduce new servlets. These servlets are mounted using the JkMount directive, as shown in the configuration snipped above.

Redirection to the PCA wiki pages are done with the RedirectMatch directive, also shown above.

o  Cygwin

The TDB software comes shipped with a number of BASH scripts that require a UNIX environment. In addition, several helper scripts also in BASH on the server machine are used to handle automated downloading of files and updating the triple stores.

Cygwin provides a UNIX environment for Windows machines, and is installed on the test server. The installation is straightforward. In addition to BASH, two other GNU/Linux programs are used by the prototype.

·  wget – A program for downloading content from the internet. In our case it’s the PCA RDL OWL file.

·  unzip – A program to unzip zipped files. We use it to unzip the packaged PCA RDL OWL file.

These two programs are available from the Cygwin installation program, and those where used in the prototype.

The choice of installing Cygwin is really one of convenience. The prototype could have been designed without it, but it made a number of things slightly easier, not the least because of prior experience with these programs and existing scripts written for other similar projects.

o  Servlet Container (Apache Tomcat)

Apache Tomcat is the Apache foundations take on the Java Servlet and the JavaServer Pages technologies. It is widely used, though not as much as its web server counterpart.

While the Apache web server installed is 32 bit, the Tomcat server is 64 bit as that makes TDB perform better. In particular TDB is able to push the burden of caching files between RAM and disk to the operating system, instead of using the JVM memory. Furthermore it is the general consensus among the users and the developers of TDB that 64 bit is preferred over 32 bit.

§  Technical Summary

Apache Tomcat can be downloaded free of charge from Apaches web pages. Links and other resources are in the appendix.

No particular setup is needed for Tomcat, although it might be advisable to increase the memory you allow it to use. If you install this on Windows, the installation package comes with a Tomcat configuration tool to easily set the required parameters (Initial Memory Pool and Maximum Memory Pool). While in the /bin directory of the Tomcat installation, write the following in a console.

tomcat7w.exe //ES//TomcatServiceName

Where “WindowsServiceName” is the name of your Tomcat service.

o  Triple Store

The TDB triple store is a persistent storage layer for Jena, which in turn is a suite of libraries for building Semantic Web applications in Java. TDB is different from a conventional SQL server in that it doesn’t require a running process to be used. When a new TDB triple store is created, all information is stored in a folder on disk. The triple store will persist in that folder, regardless of any processes running. Another program using the TDB library may then grab the files, and perform queries or update the triple store.

This is the typical scenario when a triple store is updated. One process performs the batch load into the triple store, and a query front-end grabs the triple store for queries after the update procedure is done. We have not tested whether it is possible for two Java processes to use the same TDB triple store at the same time. What we do know is that it is impossible to delete or replace a triple store when a process has a lock on it. This leads to downtime for the triple store when the database is updated. We will get back to this downtime issue in 2.6.3.

§  RDF Graphs, and Graphs in the triple store

Like most triple stores, TDB is able to handle multiple graphs. A graph is a set of RDF triples, much in the same way an RDF file is a set of RDF triples. In the prototype, each graph originates from a distinct file, though this is not required by TDB.

Each graph has an URI used when you want to restrict a query to contain only certain graphs. As with an RDF file, there is no connection between the namespaces of the resources in a graph, and the name of the graph. The graph names are meta data in its purest form; required for the inner workings of the triple store, but without any intrinsic meaning. That being said, queries that want to target specific graphs in the endpoint will need to have the graph name hard coded, so a comfortable naming scheme should be agreed on by JORD in future projects. In the prototype, if no graph is specified, the union of all graphs are queried. This is suitable for most queries, and so most users will not have to care what the graph names are.