Competitiveness and Innovation Framework

CIP-ICT PSP-2009-3-250430Annex 1 - Description of WorkGALATEAS

COMPETITIVENESS AND INNOVATION FRAMEWORK

PROGRAMME

ICT Policy Support Programme (ICT PSP)

Multilingual Web

ICT PSP call identifier CIP-ICT-PSP-2009-3

ICT PSP Theme/objective identifier: Theme 5.1

Grant agreement for: PILOT TYPE B

Annex I - “Description of Work”

Project acronym: GALATEAS

Project full title: Generalized Analysis of Logs for Automatic Translation and Episodic Analysis of Searches

Grant agreement no.: 250430

Date of preparation of Annex I : February 5, 2010

Date of approval of Annex I by Commission:

Part A

A1 Project summary

A2 List of beneficiaries

A3 Overall budget breakdown for the project

Part B

B1 Project description and Objectives

B.1.1 Project objectives

B.1.2 EU dimension

B.1.3 Maturity of the technical solution

B.1.3.a The overall picture

B.1.3.b The Components......

B2 Impact

B.2.1 Target outcome and expected impact

B.2.2 Long term viability

B.2.2.a Market Size

B.2.2.b Market Approach

B.2.2.c Business Projections

B.2.3 Wider deployment and use

Open Source Dissemination

B3 Implementation

B3.1 Consortium and key personnel

B.3.1.a The Consortium

B.3.1.b The Partners

B.3.1.c Preliminary Allocations

B.3.2.a Chosen approach

B.3.2.b Work Plan

B.3.2.b.1 WT1: Work package list

B.3.2.b.2 WT2: Deliverables list

B.3.2.b.3 WT3 Work package descriptions

B.3.2.b.4 WT4 List of Mile Stones

B.3.2.b.5 WT5 List of tentative reviews

B.3.2.b.6 WT6 Summary of staff effort

B.3.3 Project management

B.3.3.a Project Management Committee

B.3.3.b External Liaison and Quality Manager

B.3.3.c Technical Coordinator

B.3.3.d Project Coordinator

B.3.3.e Work Package Leaders

B.3.3.f Conflict Resolution

B.3.3.g Organizational and change management

B.3.4 Security, privacy, inclusiveness, interoperability; standards and open source

B.3.5 Resources to be committed

B.3.6 Dissemination / Use of results

B.3.6.a General Dissemination activities

B.3.6.b Assessment of Dissemination activities

Part A

A1 Project summary

With the growth of digital libraries and digital library federation (as well as partially unstructured collections of documents such as web sites), a large set of vendors is offering engines for retrieving contents and metadata via search requests by the end user (queries). In most cases these queries are just unstructured fragments of text in a specific language.

The first service offered by GALATEAS (LangLog) is focused on getting meaning out of these lists of queries and it is addressed to library/federation/site managers. Contrary to mainstream service in this field, GALATEAS services will not considered standard structured information of web logs (e.g. click rate, visited pages, user’s paths inside the document tree) but the information contained in queries from the point of view of language interpretation. By subscribing Lang Log, federations administrator and managers will be able to answer questions such as: as “Which are the topics which are most commonly searched in my collection, according to a certain language?”; “how do these topics relate with my catalogue?”; “Which named entities (people, places) are more popular among my users?”.

The second problem addressed by GALATEAS is the one of Cross Language Information Retrieval (CLIR) i.e. the capability of typing a query in one specific language and retrieving documents which are available in different languages. The CACAO consortium is already successfully providing services for indexing and searching over digital libraries and metadata repositories. During commercial exploration for marketing CACAO it emerged that certain institutions prefer to keep indexing and searching at their premises (using their own favorite search engine) and would be perfectly satisfied with a service of plain query translation. The second service offered by GALATEAS (QueryTrans) has the ambitious and innovative goal of providing the first web translation service specially tailored on query translation.

Languages addressed by both LangLog and QueryLog are: Italian, French, English, German, Dutch, Modern Arabic and Polish.

A2 List of beneficiaries

# / SHORT NAME / LONG NAME / COUNTRY
1 / XEROX / XEROX SAS / FR
2 / CELI / CELI SRL / IT
3 / UVA / UNIVERSITEIT VAN AMSTERDAM / NL
4 / UNITN / UNIVERSITA DEGLI STUDI DI TRENTO / IT
5 / OD / OBJET DIRECT SAS / FR
6 / GONETWORK / GONETWORK SRL / IT
7 / BAL / THE BRIDGEMAN ART LIBRARY LIMITED / UK
8 / UBER / HUMBOLDT-UNIVERSITAT ZU BERLIN / DE

A3 Overall budget breakdown for the project

Part B

Project profile

Description of the issue and proposed service/solution
With the growth of digital libraries and digital library federation (as well as partially unstructured collections of documents such as web sites), a large set of vendors is offering engines for retrieving contents and metadata via search requests by the end user (queries). In most cases these queries are just unstructured fragments of text in a specific language.
The first service offered by GALATEAS (LangLog) is focused on getting meaning out of these lists of queries and it is addressed to library/federation/site managers. Unlike mainstream services in this field, GALATEAS services will not consider the standard structured information in web logs (e.g. click rate, visited pages, user’s paths inside the document tree) but the information contained in queries from the point of view of language interpretation. By subscribing to the LangLog service, federation administrators and managers of content providing web sites will be able to answer questions such as: “Which are the topics which are most commonly searched in my collection, according to a certain language?”; “how do these topics relate with my catalogue?”; “Which named entities (people, places) are more popular among my users?”.LangLog will be available in at least 7 languages, namely Italian, French, English, German, Dutch, Modern Arabic and Polish.
The second problem addressed by GALATEAS is the one of Cross Language Information Retrieval (CLIR) i.e. the capability of typing a query in one specific language and retrieving documents which are available in different languages. The CACAO consortium (an EU project funded under eContentPlus) is already successfully providing services for indexing and searching over digital libraries and metadata repositories. During commercial exploration for marketing CACAO, it emerged that certain institutions prefer to keep indexing and searching at their premises (using their own favourite search engine) and would be perfectly satisfied with a service of plain query translation.
The second service offered by GALATEAS (QueryTrans) has the ambitious and innovative goal of providing the first translation service specially tailored on query translation.
The two services are tightly connected: it is only by virtues of successful lunch of LangLog that the consortium will gather enough multilingual queries to train the Statistical Machine Translation system adopted by QueryTrans.
Target users and their needs
Indirect “users” of the service are information seekers, which would benefit of improved, possibly cross lingual, search services. However the GALATEAS services are provided not directly to end users, but to administrators and managers of federations and search engine installations. Thus GALATEAS aims to target a high end B2B market where customers will be mostly represented by organizations running middle and large sized federations of contents. Principally, it will answer the following needs:

Need for a federation manager to understand what users are looking for, irrespective of the contents they actually access.
Need for a content provider to understand which are the directions in which the collection should be extended.
Need for a library administrator to understand which are the categories in the catalogue that fit more/less the desiderata of final users.
Need for a library manager to understand the behaviours of its users.
Need to provide cross language information retrieval in a seamless way, without changing anything in the way in which documents are indexed and managed.

Usage
LangLog: In a typical context the customer will sign a contract of log analysis service provision with one of the commercially active GALATEAS partners. At the agreed frequency LangLog will harvest log files from the federation and will start processing them. After the time of processing (which is almost linearly dependent on the amount of queries) it will be returned either with a report describing all information needs which have been detected or with extended log files to be integrated in the federation’s preferred log reporting system.
QueryTrans: The customer will negotiate with one of the commercially active GALATEAS partners concerning the level of the service. Example parameters for such a negotiation are:

Number of covered languages;
Domain of the translation (whether generic or in a specific domain);
Possible training of a specific query translation system;
…

After the negotiation phase (which should be kept as light as possible), the customer will benefit of a service that, for any query in a certain language, will return its translations in n other languages (targeted languages are Italian, French, English, German, Dutch, Modern Arabic and Polish). It is up to the content provider to intercept the user query as well as her needs for cross-linguality. It is also a responsibility of the content provider to match the translations of the query against its indexes and/or databases.
Technology
Both LangLog and QueryTrans will be based on standard web service technology, adopting, at least in a first phase, axis2 as an application server. In this phase we will also adopt a hosting strategy with a single processing unit serving on average 6 customers. The main technological value is not represented, however, by modalities for service provision but by the underlying components, which represent a commercial take up of most innovative available software for machine translation, corpus alignment, automatic classification and clustering.
Content
The service LangLog will have to access to specific software and resources such as NLP components, thesauri, classifications schemas etc. These are all available to partners of the consortium. The training of the machine translation system to be used by QueryTrans will initially require access to massive quantities of possibly multilingual logs data. As detailed elsewhere we will acquire this data in exchange of the royalty free provision of the LangLog service for a certain period. Moreover, already at time t0 the consortium can rely of substantial amount of query logs derived from Excite, Yahoo!, Europeana, TEL, and all libraries federated in the course of the CACAO project.
Sustainability
Commercial partners of the project will jointly or disjointly exploit the GALATEAS technology under a fee based model for services (where the underlying technology itself for the services will be available either under open source licenses or proprietary licenses). GALATEAS exploitation will be therefore supported by on-line content providers, which in 2010 are expected to produce only in Europe 8.3 billions revenues (not to count the 170 billions Euros which are yearly generated by eCommerce sites, which generally relies on a resident search engine). Under the provisional business plan described in the relevant section it is expected that already in the first year after project termination GALATEAS technology will produce yearly revenue stream of 2 millions Euros, with an NPV (Net Present Value) of about 3 million Euros (considering project costs) after 3 years from project termination.
Ownership
The ownership of technology produced under GALATEAS is based on a mixed model of open source and proprietary software development and fee based access to proprietary services. In short, all software developed in the project using programs originally available under an open source license will continue to be licensed under an open source modality. This use of open source software relies, however, on proprietary software and web services (mostly NLP and dictionary based translation services), which will remain proprietary and will be accessed under standard commercial conditions, when commercial exploitation will start. By decoupling GALATEAS from its ancillary software and services, we will open service provision to concurrency: it is not unreasonable to think that in the future a partner company providing e.g. NLP for a certain language will be replaced by another company which provides better software or services at reduced fees.
Concerning log file ownership, title will remain with the content provider. However the content provider must authorize GALATEAS to produce (and use) derivative works on the basis of such logs: most crucially query logs (once queries have been aligned) will constitute the training set for the query centred machine translation system.

B1 Project description and Objectives

B.1.1 Project objectives

Every day, millions of search queries are issued to content providers ranging from all-purpose web information sites (such as Google and Yahoo!) to digital library sites (such as Europeana and TEL) to merchant sites (such as Kelkoo and PriceGrabber). These queries are a precious resource for understanding user behaviour with respect to a collection of documents. From a careful analysis of such queries, content providers can understand what is the information users are really looking for, what are the most current strategies to retrieve digital objects, and what is the degree of matching between the needs of end users and the content offered by the web site. It is thus astonishing that no single provider of services/products of log analysis ever attempted to go beyond the type of support to log analysis provided by services such as Google analytics, just to mention the most popular. These services provide tools to segment user queries in words and provide statistics about the occurrences of single words, but this is far from satisfying content provider needs:

Existing analysis tools only consider words as chains of characters. Therefore any generalization concerning searches about, for instance, “sea” and “ocean” is simply lost. Moreover all ambiguities intrinsic in language are not resolved, so that a word such as “bank” will be ranked irrespective of its meaning.
They do not perform any match between user searches and the informative backbone of the content aggregation, be it a standard classification system, subject headings, product types, or plain list of categories.
They do not provide any hint about search episodes. Each search is seen as an isolated event and there is no attempt to determine sequential patterns of search activities

Queries as recorded in log files, besides being a precious resource for understanding users’ behaviours, could also become a key resource for achieving cross-lingual access to information, if we could apply to them appropriate algorithms of alignment. Indeed if one could dispose of a sufficiently large amounts of pairs of queries which are translation equivalents, it would be relatively easy to derive a machine translation system specially adapted 0to deal with queries. It would be consequently possible to set up a service for plain query translation: such a service could be accessed by any kind of monolingual search engine in order to acquire cross-language functionalities.

The objective of GALATEAS is to fill these gaps by setting up two web services:

LangLog: It will analyze transaction log containing queries to search engines for a given content provider. By applying statistical technologies coupled with language oriented services, it will produce reports concerning the informational needs of the users accessing that particular aggregation. In other words, the same way in which standard log analysis systems provide generalizations of paths of users inside a web site, LangLog will provide generalizations of the actions that information seekers perform in order to find contents inside a searchable collection of digital objects.
QueryTrans: It will translate queries coming from an external search engine into several target languages: the external search engine will use these translations to return to the user results into languages different from the one in which the query was formulated. To be clear QueryTrans is not a cross-language information retrieval system, performing indexing and search, but just a query translation service.

These two web services are tightly linked. Apart from the fact that they access the same range of NLP based web services, LangLog is essential to allow the continuous acquisition of large quantities of queries in different languages: it is only on the basis of such an acquisition that the machine translation systems that constitute the backbone QueryTrans can be trained and thus the service itself can be provided.

The LangLog service will provide services of query log analysis for at least 7 languages, namely Italian, French, English, German, Dutch, Modern Arabic and Polish. It is therefore conceivable that for each combination of these languages (42 pairs) a query oriented machine translations could be available. Practically, it is unlikely that during the project duration the LangLog service will be able to gather enough translationally equivalent pairs to cover each language combination. The project will therefore focus on Italian, French, German and English combinations (all pairs). For Dutch, Modern Arabic and Polish it will rather concentrate on finding translation equivalents with respect to English and training respective machine translation systems: in this way that missing pairs could be covered by transitive translation using English as a pivot language.

In summary, from a technological point of view, the main steps through which the objectives above are to be reached are the following:

Taking advantage of the involvement of several partners in different DL initiatives, start gathering constantly increasing amounts of search episodes;
Use session and user-id information together with semantic query analysis in order to determine information paths within search episodes. These information paths will be delivered to information providers, which will be able to tailor both resource acquisition and search strategies on the basis of observed patterns (LangLog service).
Improve and apply the technologies delivered by the CACAO project (three GALATEAS partners were involved also in CACAO) in order to identify queries which can be mutually considered translation pairs. This step will produce a continuously evolving parallel corpus.
Provide tool to integrate data extracted from the parallel corpus into resources for cross language information retrieval. In particular we foresee to increase the quality bilingual translation dictionaries available to at least three partners with query derived translational equivalents.
Use the corpus to train a hybrid statistical MT system. This phase will deliver an MT system specially designed to translate queries: it would be the first one on the market (QueryTrans service).

Customers of GALATEAS will be organizations running content delivering web sites powered by a search engine (Digital Libraries, Content Aggregators, Merchant Sites). We believe that for these organizations the need of understanding what their users are looking for and how they could better match their expectations is a matter of survival. For many of them, the fact of achieving cross-linguality in content access is also a crucial competition factor.