PROTEUS PROJECT

NEW YORK UNIVERSITY

Ali Argyle

Darren Jahnel

Jon Liebowitz

Sachiko Omatoi

Jeremy Shapiro

Graig Warner

Dan Melamed

A Bitext Harvesting and Distribution System

1

PROTEUS PROJECT AT NEW YORK UNIVERSITY

Bitext Harvester Project Guide

 NYU

Proteus Project

715 Broadway 7th floor

New York NY, 10003

1

Table of Contents

CHAPTER 1

What is a Bitext Harvester1

CHAPTER 2

Bitext Harvester Database 3

Database Requirements3

Database Installation3

Chapter 3

Bitext Harvester Spider5

Spider Introduction5

Spider Installation6

Spiderman Menu Command

Summary9

Chapter 4

Bitext Harvester Filters11

Filter Setup12

Running a Filter13

Terminating the Filter13

A Sample Scenario13

Notes on Filters14

Chapter 5

Web Based User Interface17

Installation Steps17

Chapter 6

Developer Wish List for

Future Improvement20

APPENDIX A

Requirements Document22

APPENDIX B

Database Spec51

Chapter

1

What is a Bitext Harvster?

We must acknowledge the "fact" of bilingualism and build upon it.
-Maurice Beaudin

t

he Bitext Harvester Suite of applications was created to exploit the ever expanding resource of online parallel text data. A parallel text or 'bitext' is a pair of documents that are identical in content but have been written in different languages. These bitext resources are extremely useful in the development of natural language processing techniques, and can serve as both training and testing data for the NLP community.
The Bitext Harvester is a system for collecting processing and distributing parallel texts retrieved from the Web.

The system works as follows:

Spider - A spider constantly trolls the Web looking for pairs of documents that might be parallel texts, downloads them locally, and places key management information into a database.

Filter - The resulting documents are processed by filtering programs, which help decide whether the pair is indeed a parallel text worth saving. For example one filter might decide what language each of the documents is written in. The results are recorded in the database.

Web Interface - A web site enables people to investigate the progress of the spider and the filtering; for example, someone could specify 2 languages and ask for all parallel texts in those particular languages The value of this resource will be significantly increased by the harvester, and the tools that have been developed to tag and keep track of the candidate parallel texts that have been gathered.

The Bitext Harvester Application consists of four main components, the spider, the user interface, the filter capability, and the database backend. Each of the first three may be used independently along with the database to perform its specific function. The web spider downloads pages and deposits data into the database, the filters serve to analyze and tag the data in the database, and the user interface allows a zipfile summary of the data to be downloaded via a web interface. The installation instructions pertain to all four components.

Chapter

2

Bitext Harvester DataBase

t

he Database works in conjunction with all parts of the Bitext Harvester Suite, and is the first thing you will need to install. The Database installation instructions below will help you setup your MySQL database with the appropriate tables and fields.

Database Requirements

The first step to getting any of the three main components working is to have a working database for them to talk to. The server that was used for development and testing was MySQL Version 4.1 which can be found at
Note: As of Dec 15, 2003 this was the only version of MySQL that had the subquery capabilities which were needed in the project. Hopefully any future version of the Bitext Harvester will be able to use stored procedures which are going to be supported with version 5.

Database Installation

  1. Download and Install MySQL >= 4.1 from
  2. Run mysql database server daemon on your machine.
  3. Create a database named bitextharvester as root. – mysqladmin create bitextharvester
  4. Create tables – mysql –u root –p[password] bitextharvester < [TableName].tbl
  5. Load Static data - mysql –u root –p[password] bitextharvester < [TableName].dat (for Languages and Topics.)
  6. Add rows into Filters and FilterInstances tables, as Filter user guide suggested.
  7. Please find MySql command detail in online MySql manual.
  8. Please see Table.xls for table details, such as data type, field description.
  9. CreateBitextHarvester.sh [database password] will do step 3 – 5.

Developer Notes
All Id's are auto_incremented. (MySql assigns Id automatically. Deleting entire table will not reset the row count. Please re-install table or use the truncate command).
At this point, there are no specific users or table access permissions for different processes (Spider, Filter, and Web file download).

Chapter

3

Bitext Harvester Spider

Not all keys hang from one girdle.
-Anonymous

t

he spider’s job is to trawl the web looking for possible parallel text candidate pairs. Each spider has some basic configuration information like where to start, how to choose the pages that are most likely to be bitext pairs, what to call itself, etc. For your first spider we recommend that you try the default spider simplespider. Once you have a better idea of how the spider works you will be able to build one that does exactly what you need, and grabs the type of documents that you want.

Spider Introduction

Now that the database is in place you can install the spider and start putting data into it. There are two different executables that you will need, the spider.sh executable is used to instantiate a new spider and the spiderman.sh executable is used to manage any spiders you may have running. Two different spider package tars are included in the distribution, one for developers with all of the source included and one for run-only purposes. If you are following this demonstration just use the simpler run only version called spiderman_bin.tar.

Once the SpiderMan has been started, it will run until manually shut down. When the SpiderMan starts up, it also starts a background thread. This thread monitors the SPIDER database table for any entries of spiders. The entry in the table gives the SpiderMan info about this spider's name, along with the RMI registry where it has registered [RMI is a java tool to allow remote processes to call methods on one another. A more detailed explanation is beyond the scope of this doc - but easily found on the web]. For all spiders found in the DB, the SpiderMan attempts to look them up, and validate that they are alive. If so - we store this spider in a Hashmap. If not alive, the row is deleted from the DB by the SpiderMan. Once the spider is in the map, the user is able to call various actions on the spider.

The SimpleSpider implementation is a webcrawler that will iterate through a list of queries, and call a particular search engine with that query. The resulting html page will be scraped for all links. For each link found, that page will be retrieved. That page will then be scraped looking for the potential link that cause this page to be retrieved by the engine in the first place. If this link exists - both files will fill be retrieved and stored (along with duplicate checking using a hash scheme to avoid unnecessary duplicate files). An example of a query might be 'click for French', in which case the page found, and the page that it points to may be retrieved as a potential bitext. That being said, this demo spider does a lousy job of finding accurate bitext pairs, but is set up to be greedy to bring in many texts so additional filters may decide to keep or discard them. Why is it lousy? When a page is found by using a search engine to query on 'click for french', the page that may be the French version is very seldom a link with the words 'click for french'. What I did in this 'greedy' implementation is just search for any link that contains any word in the query (and hope for the best). Other problems occur in how these links are formatted (meaning how there paths are structured, etc., so a high percent create Malformed URL's). More precise scraping methods may be used, but were beyond the scope of this project.

Spider Installation

1. Make certain that you have J2SE1.4.1 or greater

2. Follow the database installations above if you haven’t already done so

3. Choose which tar file (run only or src) you would like to use and unpack the tar file.

Unpack the run only tar file: > tar xzf spiderman_bin.tar.gz

The following structure will be created:

startrmi: start script for RMI
spiderman
spiderman.sh: start script
config
lib
log
spider
spider.sh: start script
config
lib
log

Or unpack the developer tar file: > tar xzf spiderman_dev.tar.gz

The following structure will be created:

dist: (distribution of spiderman and simplespider)
lib: (all needed jars)
config: (all config files)
src: (java files)

edu.nyu.bitext.server
edu.nyu.bitext.server.db
edu.nyu.bitext.spider
edu.nyu.bitext.spider.db

build (class files)

edu.nyu.bitext.server
edu.nyu.bitext.server.db
edu.nyu.bitext.spider
edu.nyu.bitext.spider.db

5. Starting the RMIRegistry
The rmiregistry will need to be running for the apps to work correctly. Start the RMIRegistry by typing: > ./startrmi

6. Starting a Spider with Spider.sh:
The spider.sh executable is responsible for instantiating a spider. Before running the spider manager lets go over an example of starting a spider so we will have something to manage. By default ./spider.sh will use ./config/simplespider.cfg for its configuration. If you look inside this file you will notice that the name of the spider is specified as 'SimpleSpider', this is the name you will use when refering to the spider in the spider manager utility so make sure it is unique and descriptive for each spider that you start.

Figure 1uses shows steps 5 and 6 in a terminal window.

7. Using the Spider Manager ('SpiderMan'):
The spider manager ('SpiderMan') system is a mechanism built to allow for central control of multiple spiders. A spider may be loosely defined as a process that scans the web for possible bitexts (a bitext is two 'identical' texts in different languages). Technically, a spider can be anything, as long as it implements the edu.nyu.bitext.shared.Spider interface. For instance, a spider could potentially scan repository files instead of web crawling, and still be controlled via the SpiderMan.
To run the SpiderMan, go to the install directory: type: >./spiderman.sh

A usage menu will be shown [NOTE: items in brackets are optional parameters]:
Implementations of these commands are spider specific, and it is up to the individual spider to implement them how they choose. The SpiderMan provides a centralized, and convenient way to run these commands against any registered spiders. The spider developer should do their best to implement these commands in a reasonable manner. The summary below tries to describe how these commands should act in a generic way.

Figure 2 – the spiderman in action

In the SpiderMan menu, 'spider' refers to the spider name that you would like to run the command for. The spider name is specified by the user in the spider config file. All spiders must have a unique id.. If you look at the spider_french/config/spider.cfg example:

SpiderID is set in first line: SPIDERID=SimpleSpider_FRENCH

This would be input into 'name' field in SPIDER table in the database. The 'id' field is an autoincrement field in MYSQL, so for this example the following would be generated by MYSQL.

Id / name / host / port / status / created
4 / SimpleSpider / localhost / 1099 / 0 / 2003-12-22 14:24:56

And yes (as I'm sure you're wondering), unfortunately it is possible that another spider could overwrite an existing spider using the same name, which would cause the SpiderMan to use the most recently registered. In some ways this might make it easier to redeploy - without worrying about running an old version, but also has the problem of accidentally (or unknowingly) using same name. In future versions this might change.
[Note that in spiderman - there is a hashmap of [spidername (not id) as key, to spider ] - so if you changed to use spider ID for lookup, map would have to change.]

Spiderman Menu Command Summary

run spider [configfile] [queryfile]
Tells the spider to begin crawling web (or file systems), in order to locate bitexts. This command may take two parameters:

  • configFile - This is a property file that can be set for a spider. Individual spiders will most likely need specific property files, which should be documented by the spider developer. A configFile for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.
  • queryFile - This is also spider specific, but should most likely be a list of queries for a spider to send to a search engine. A queries.txt for the included 'SimpleSpider' implementation is given as an example in the spiderman/config folder.

setconfig spider configFile

This command has the ability to send property files to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)
setquery spider queryFile
This command has ability to send query list to spider. Spider may be implemented to act on these while running (this will be up to spider implementation)
halt spider
This halts the spider. Ability to pause spider from fetching bitexts.
continue spider
This wakes up spider from halted state
throttle spider
This is an optional feature that one could build in to a spider's activity. ay to automatically halt spider, if number of bitexts retrieved is ahead of filtering. It will give a chance for filters to catch up.
status [spider]
Shows the status of all registered spiders. It is up to spider implementation to provide status information for when this command is called.
showquery spider
showconfig spider
Dumps properties/queries for spider to console.
quit
Exit SpiderMan
help
Display the full menu of options.

Chapter

4

Bitext Harvester Filters

All roads do not lead to Rome.
-Slovenian Proverb

n

ow that you have gathered some bitext candidate documents, the next step is to perform any additional processing that can help you decide which of the text pairs are worth keeping. Some basic filters are included in the distribution and should be used as a guide to help you build filters of your own. The filtering system provides a means for automating the process of filtering the documents and bitexts downloaded by the spider. It accomplishes this by storing information about the filtering results in the database and judging, based on user-defined criteria, whether the filtered entity should remain valid for further testing or marked as invalid.

Three varieties of filters:

  1. Text filters (update TestResults)
  2. Bitext filters (update TestResults)
  3. Column value filters (update the Texts or Bitexts table)

Filtering runs from the command line (include screen shot of command line options). User needs to provide the following: Configuration file, properties file, and the executable file. Configuration and executable file locations can be loaded from the database for filters that have already been inserted into the database

Purpose of the various files:

Properties file – environmental settings (database driver / location / username / password, directory where logs are written to)

Executable file – Performs the testing on one or more files. At minimum, read permission should exist for the executable. Execute permission needs to exist if this file is run without prefixing its name with a shell or interpreter language name (e.g. perl, sh, awk, and so on). It's recommended that the minimum permissions set for this file be 755.

Configuration file – settings specific to a filter executable. Authored in XML. This communicates the following to the filter system (details included later):

  • How many files are to be tested by the script
  • What is the syntax used by this script at the command line
  • Of the texts / bitexts kept in the database, which ones should be tested by this filter
  • What results, when printed to standard output, imply that the test has passed or failed
  • Which tables and columns in the database need to be updated

Filter Setup
1. For developers and users: filters.zip
>unzip filters.zip
The following directory structure will be created:

  • filters
  • filter_launcher.sh: the launch script
  • properties: the default properties file. Logging is disabled by default.
  • edu/nyu/bitext/filters: source and class files
  • edu/nyu/bitext/filters/utils: source and class files
  • lib: all needed jars
  • logs: default directory where logs will be stored
  • scripts: sample filters
  • config: DTD and XML configuration files for the above scripts

2. If you are running on a linux machine you may need to change the end of line characters to linux format in filter_launcher.sh. You also may need to change some of the paths in filter_launcher.sh to correspond to your install locations.

Running the Filter

Execution from the command line takes one of three forms:

  1. To add a new filter to the database and execute it on the bitexts or texts in the database:

>filter_launcher.sh –n FILTER_NAME –c CONFIG_FILE –EXECUTABLE [–d DESCRIPTION] –o PROPERTIES_FILE

where the description is optional and the other tags mandatory.

  1. To launch a filter with information in the Filters table of the database:

>filter_launcher.sh –i FILTER_ID –o PROPERTIES_FILE

where FILTER_ID is the primary key value from the Filters table.

  1. To query the database for information about available filters:

>filter_launcher.sh –q [FILTER_ID] –o PROPERTIES_FILE