Data Portal for Machine Translation Data-Collection

Data portal for Machine Translation data-collection

Supervisor: Prof. Andy Way

Research Fellow: Dr. Teresa Lynn

Successful data collection is a fundamental requirement for building accurate statistical machine translation (MT) systems. Such data includes parallel bilingual texts, machine-readable dictionaries and glossaries.These come in various file formats.

However, collection, cleaning and validation of data is often a tedious, time-consuming manual task. When data collection is outsourced, semi-automated approaches to data curation are desirable to allow 3rd party, sometimes non-technical users to contribute to the collection effort by sharing their own data resources.

To this end, the ADAPT Centre's MT group require the development of a web-based data portal, where contributors can upload their data through a simple, easy to use process. The portal functions will include validation checks for the file formats, user-input options for collecting meta-data,

The portal will be web-based, password protected (user-accounts), and should allow users to specify what, if any licence is applicable to their data.

The portal can be developed in the coding language of choice but will need to work in concert with existing data validation and preprocessing software. Additionally , the portal will need to accept a wide variety of file formats, XML, TMX, XLIFF, MS-WORD, HTML, amongst others. Statistics about the number and nature of submissions, as well as the associated results of validation and preprocessing will be logged and available to both the user (regarding his/her own submissions) and also to the administrators (system level overview).

Automatic Error Correction (3rd or 4th Year)

Supervisor: Prof. Andy Way

Over the years I have collected a series of 'typical mistakes' made by students when they attempt to write good scientific English. See:https://docs.google.com/document/d/1yD1JR6IjuddeVTxXQRrfybHHs6HiVjzl3u1FO0rozXw/edit

It would appear that many of these 'errors' can be fixed automatically. Accordingly, the student will write a program to take in a document containing examples of the said errors, and output a document in the same format with the errors corrected. The project can be made more difficult by extending the range of input files: they should be able to handle .txt and .tex (LaTeX files), but conceivably this could be done for .doc/.docx files too. Further difficulty can be contemplated by demonstrating that the formatting (bold, italic, underline, or any other inline tags) is maintained in the conversion process.

While this project could be done as a series of commands collected in a Linux script, further marks could be obtained by adding a front end, allowing file input/output, file display etc.

Machine Translation for Food Ingredients (3rd or 4th Year)

Supervisor: Prof. Andy Way

Many people today live with the daily problem of dealing with allergies. Ingesting foods containing such allergic ingredients can be very traumatic, with life-threatening consequences. What this project envisages is (i) the ability to scan a product on a supermarket shelf, and immediately have the possible allergens displayed on one's hand-held device; (ii) if these could be translated, then this could be a potential life-saver when travelling abroad; for the purposes of this project, we'll have to assume that this utility is provided for a foreign visitor to these shores (i.e.. we'd do (say_ English-to-French translation, rather than the other way round).

The interested student would have to contact a major food supplier (e.g. Kelloggs, Nestle etc., as opposed to Tesco, SuperValu etc.), and see whether the ingredients of their products can indeed be retrieved via a barcode scanner. I have contacted the Manager of the DCU Spar shop, who is keen to help in any way possible. Assuming the ingredients can be retrieved in this way, one could imagine that either potential allergens are alerted to the user, or perhaps a restricted (personalised) set of allergens could be displayed depending on the user's settings (perhaps different for various members of the family). Secondly, a machine translation engine would need to be connected to the tool via an API (e.g. for Google Translate) so that the allergens can be translated and displayed on-screen.

Link: Fostering internal communication and collaboration in a large research workplace

Supervisor: Dr. Lamia Tounsi

This project aims to adapt a matching algorithm based upon data collected from researchers profiles and publications in order to effectively identify potential partners who are compatible for a long-term collaboration.

Bag-of-Words Game
Supervisor: Prof. Qun Liu

Given a bag of words (a word can occur more than once in the bag), the user is asked to find the most possible sentence which is exactly composed by the words in the bag. Every word in the bag should be used exactly the same times as it occurs in the bag, and no additional word can be used.

Collect and Memorize
Supervisor: Prof. Qun Liu

A user can collect any pieces of information (words, idioms, pictures, etc.), save it in the server and then will be reminded from time to time until he/she has memorized it.

Adaptive Neural Machine Translation

Supervisor: Prof. Andy Way

Research Fellow: Dr. Jinhua Du

In this project, we will investigate how to build an adaptive/incremental retrained neural machine translation so that it can quikly learn knowledge from a new domain and achieve reasonable translation quality.

Neural Translation Recommendation System

Supervisor: Prof. Andy Way

Research Fellow: Dr. Jinhua Du

In this project, we will develop a high-quality neural network based translation recommendation system from a couple of translation outputs which might come from translation memory system, statistical machine translation systems or neural machine translation systems. The purpose of the system is to find out the best translation output and recommend to translators to save their post-editing time.