Title / VLO analysis /
Version / 1
Author(s) / Twan Goosen (CLARIN ERIC)
Date / 2014-01-17
Status / Final
Distribution / public
ID / CE-2014-0263 (summary)

This document summarises the conclusions of an analysis of the CLARIN Virtual Language Observatory (VLO), which was carried out in January 2014 and described in detail in the document “Virtual Language Observatory Analysis”. The analysis is based on VLO version 2.18, which was under development at time of analysis.

1Functional analysis

The current version of the VLO in principle meets all functional requirements expressed in the 2012 LREC paper[1] on the VLO. However a number of functional improvements are desirable. For a large part, the degree to which the VLO can adequately function as a resource portal depends on the quality of the metadata it is based on. Another factor is the exact definition of the mapping, to which there are many degrees of freedom. Therefore it should be noted that this analysis focuses on the ‘core’ quality of the VLO as it is and the potential limits of its functionality.

The currently available options for defining a mapping from metadata to facet values are considered to be insufficient. The challenge is to achieve a scalable and maintainable mapping strategy that is generic but also flexible enough to cater for specific data sets. Various options canbe considered. In any case, it is likely that some relatively fundamental changes or additions to the VLO importer component will be required in the near future.

The usability of the VLO web application is acceptable but could be improved. End-user documentation and in-application guidance by means of instructions and a layout that reflects the workflow would help the novice user in finding its way. Alternative ways of presenting some of the facets and their values could also improve the user experience. Cross-platform accessibility could be improved by migrating to a more flexible, CSS-based layout (ideally with alternative styles for different form factors) and providing fallbacks for all Javascript based functionality.

2End-product analysis

The VLO’s documentation is somewhat scattered and some sources could easily be merged with others. The technical documentation would benefit from some high level diagrams. The Trac page could serve as a ‘hub’ for technical documentation. A reference to end-user documentation from the web application itself will make this much easier to find for the targeted audience.

The performance of the different software components of the VLO seems quite good. The SOLR database, in which the indexes to the metadata records are stored, has been able to cope with the increasing number of records the VLO has faced over the last year. The importer component, which processes harvested CMDI records and is responsible for the mapping from the fields in these records to facet values, takes roughly 2 to 4 hours to process the entire set, which at the time of writing consists of over 600,000 records. The web application presents these facets and records in a reasonably responsive manner with a number of acceptable but noticeable bottlenecks on the page showing individual records. Further analysis showing whether the SOLR and importer are scalable and whether alternative querying strategies could improve the performance of the web application when representing an individual record would be desirable.

There are no known issues with the stability of the SOLR database or the importer. However, recently the web application has suffered from stability issues. The application has regularly stopped working with critical levels of heap space usage causing it to become unresponsive and requiring a restart of the Tomcat servlet container to become functional again. The exact cause has not been determined with certainty but it is assumed that the inability of Wicket’s caching mechanism to deal with heavy traffic in the form of web crawlers (the “Googlebot”) is a big factor. An upgrade to the latest version of the web application framework Wicket, giving more control over the caching parameters, is expected to remedy this. This upgrade has been applied to the current development version and will be tested over the coming weeks.

The exploitation of access statistics could be extended. At the moment, some relatively crude statistics are available to a limited group of people. More advanced analysis methods (e.g. path analysis and clustering of accessed resources) could help gain more insight in the way the VLO is used, which in turn could help optimizing the application’s usability and performance. There is also a potential for providing the metadata providers with (aggregated) statistics on their collections.

The application could also benefit from more advanced methods of performance analysis. Performance bottlenecks could be detected through the usage of a number of existing tools with relatively little effort. Real time performance monitoring of the production environment is already quite sophisticated but could be improved for example by probing status values of the SOLR database.

The recently added support for theming has made it easier to integrate the VLO in different areas of the CLARIN ecosystem. At the moments the themes are built-in and there are only two, but this could be refactored into a more flexible solution. The same instance can be rendered using any of the available themes, but it is also technically possible to run separate instances of the VLO with their own set of records, mapping, configuration, theming, etc. However, such is not explicitly supported at the moment so some attention would be required in terms of documentation and configuration should this become common practice within CLARIN.

In terms of infrastructure integration, the VLO can serve both as a portal as well as an endpoint. It already has integration with the Federated Content Search common search interface of CLARIN-D, allowing the user to perform a content search in the context of metadata records annotated with content search endpoint information. Furthermore it is ready for integration into a broader CLARIN portal that has been envisioned within the community: keyword search and/or facet selection can be performed through query parameters in the URL of the VLO main page, allowing external pages to trigger arbitrary searches within the VLO.

3Code base analysis

The components of the VLO are organised into Maven projects, which makes it relatively easy to manage the build and packaging process, internal dependencies and external libraries. Moreover, it makes the code base highly portable. The structuring of these projects could be slightly improved.

The code itself is mostly Java based (complemented by some HTML, CSS and Javascript for the views of the web application) and of reasonable quality. However, best practices are not followed throughout the code, compromising the readability and maintainability to some degree. Some of the frameworks are not utilised entirely as intended or to their full extent. The code base would also benefit from further and stricter modularisation.

An important aspect of any code base of reasonable size is automated testing. Both components with custom code, i.e. the importer and the web application, contain unit tests. The amount of code covered by these tests is 65% and 17% respectively, which leaves room for improvement. Increasing the degree of modularity and loosening the coupling between these modules (by means of dependency injection) would improve the testability of the code, in particular by increasing the granularity at which tests can be performed.

A plan for carrying out manual acceptance tests is available but has not been maintained since mid 2012. Updating this plan and reintroducing the formalised acceptance test as a part of the release process is desirable.

Both Javadoc and documentation by means of inline comments are present in both components and cover the code base to a reasonable degree. However the former is out of date or incomplete in a number of places and the latter could be denser in complicated sections of the code.

The (future) maintainability of the code base depends to a large degree on the expertise level required to work on it. This in turn depends largely on the libraries and frameworks on which the code depends. The most important frameworks used in the components of the VLO that require some degree of expertise on top of standard Java expertise are the web application framework Apache Wicket, the search platform Apache SOLR, the streaming XML parser VTD-XML and the XML serialisation framework Simple. The latter could relatively easily be replaced with the more standardised technology JAXB. The former three all have a relatively steep learning curve but for none of them there are obvious alternatives that clearly require a lower expertise level.

A number of the libraries on which the VLO depends have newer versions available (mostly minor, but in some cases, such as Wicket and SOLR, there are new major releases). Upgrading increases the need for testing and in some cases implies some quite fundamental refactoring but in general it is advisable to stay up-to-date with the latest versions of libraries, as they generally have improved performance, stability and security.

1

[1] Van Uytvanck, D., Stehouwer, H., & Lampen, L.(2012).