Text Alignment Based on the Correlation of
Surface Lexical Sequences

IASON DEMIROS1,2 , CHRISTOS MALAVAZOS2,3 , IOANNIS TRIANTAFYLLOU1,2

1Institute for Language and Speech Processing, Artemidos & Epidavrou, 151 25, Athens

2National Technical University of Athens

3Hypertech S.A., 125-127 Kifissias Ave., 11524, Athens

GREECE

Abstract: - This paper addresses the problem of accurate and robust alignment across different languages. The proposed method is inspired by automatic biomolecular sequencers and string matching algorithms. Lexical information produced by a text handler and sentence recognizer is transformed into a lexical sequence, encoded in a specific alphabet, in both languages. The algorithm uses a quantification of the similarity between two sequences at the sentence level and a set of constraints, in order to identify isolated regions of high similarity that yield robust alignments. Residues are subsequently aligned in a dynamic programming framework. The method is fast, accurate and language independent and produces a success rate of ~95%.

Keywords: - alignment, lexical sequences, similarity metrics, sequence correlation, dynamic programming.

1  Introduction

Real texts provide the current phenomena, usages and tendency of language at a particular space and time. Recent years have seen a surge of interest in bilingual and multilingual corpora, i.e. corpora composed of a source text along with translations of that text in different languages. One very useful organization of bilingual corpora requires that different versions of the same text be aligned. Given a text and its translation, an alignment is a segmentation of the two texts such that the n-th segment of one text is the translation of the n-th segment of the other. As a special case, empty segments are allowed and correspond to translator’s omissions or additions.

The deployment of learning and matching techniques in the area of machine translation, first advocated in the early 80s (Nagao 1984) proposed as “Translation by Analogy” and the return to statistical methods in the early 90’s (Brown 1993) have given rise to much discussion as to the architecture and constituency of modern machine translation systems. Bilingual text processing and in particular text alignment with the resulting exploitation of information extracted from thus derived examples, created a new wave in machine translation (MT).

In this paper, we describe a text alignment algorithm that was inspired by the recognition of relationships and the detection of similarities in protein and nucleic acid sequences, which are crucial methods in the biologist’s armamentarium. A preprocessing stage transforms the series of surface linguistic phenomena such as dates, numbers, punctuation, enumerations and abbreviations, into lexical sequences. The algorithm performs pairwise sequence alignment based on a local similarity measure, a sequence block correlation measure and a set of constraints. Residues of the algorithm are subsequently aligned via Dynamic Programming. The Aligner has been tested on various text types in several languages, with exceptionally good results.

2  Background

Several different approaches have been proposed tackling the alignment problem at various levels. Catizone's technique (Catizone 1989) was to link regions of text according to the regularity of word co-occurrences across texts. (Brown 1991) described a method based on the number of words that sentences contain.

(Gale 1991) proposed a method that relies on a simple statistical model of character lengths. The model is based on the observation that lengths of corresponding sentences between two languages are highly correlated. Although the apparent efficacy of the Gale-Church algorithm is undeniable and validated on different pairs of languages (English-German-French-Czech-Italian), it seems to be awkward when handling complex alignments.

Given the availability in electronic form of texts translated into many languages, an application of potential interest is the automatic extraction of word equivalencies from these texts. (Kay 1991) has presented an algorithm for aligning bilingual texts on the basis of internal evidence only. This algorithm can be used to produce both sentence alignments and word alignments.

(Simard 1992) argues that a small amount of linguistic information is necessary in order to overcome the inherited weaknesses of the Gale-Church method. He proposed using cognates, which are pairs of tokens across different languages that share "obvious" phonological or orthographic and semantic properties, since these are likely to be used as mutual translations.

(Papageorgiou 1994), proposed a generic alignment scheme invoking surface linguistic information coupled with information about possible unit delimiters depending on the level at which alignment is sought.

Many algorithms were developed for sequence alignment in molecular biology. The Needleman-Wunsch method (Needleman 1970) is a brute-force approach performing global alignment on a dynamic programming basis. Wilbur and Lipman (Wilbur 1983) introduced the concept of k-tuples in order to focus only on areas of identity between two sequences. (Smith 1981) describes a variation of the Needleman-Wunsch algorithm that yields local alignment and is known as the Smith-Waterman algorithm. Both FASTA (Pearson 1988) and BLAST (Altschul 1990) place restrictions on the Smith-Waterman model by employing a set of heuristics and are widely used in BioInformatics.

Alignment Framework

3.1  Alignment, the Basis for Sequences
Comparison

Aligning two sequences is the cornerstone of BioInformatics. It may be fairly said that sequence alignment is the operation upon which everything else is built. Sequence alignments are the starting points for predicting the secondary structure of proteins, for the estimation of the total number of different types of protein folds and for inferring phylogenetic trees and resolving questions of ancestry between species. This power of sequence alignments stems from the empirical finding, that if two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor. This is a non-trivial statement. It implies two important facts about the syntax and the semantics of the encoding of function in sequences: (i) function is encoded into sequence, this means:
the sequence provides the syntax and
(ii) there is a redundancy in the encoding, many positions in the sequence may change without perceptible changes in the function, thus the semantics of the encoding is robust.

3.2  Lexical Information, Alphabets and
Sequences

Generally speaking, an alphabet is a set of symbols or a set of characters, from which sequences are composed. The Central Dogma of Molecular Biology describes how the genetic information we inherit from our parents is stored in DNA, and how that information is used to make identical copies of that DNA and is also transfered from DNA to RNA to protein. DNA is a linear polymer of 4 nucleotides and the DNA alphabet consists of their initials, {A, C, G, T}.

The text alignment architecture that we propose is a quantification of the similarity between two sequences, where the model of nucleotide sequences that is used in molecular biology is replaced by a model of lexical sequences derived from the identification of surface linguistic phenomena in parallel texts.

Inspired by the facts presented in 3.1, about encoding function in sequences, we convert the text alignment problem into a problem of aligning two sequences encoding surface lexical information characterizing the underlying parallel data. Yet, although Descartes reductionism has turned out to be a home run in molecular biology, we understand that, by and large, global, complete solutions are not available for a pairwise sequence alignment, and we try to constrain the problem with information that can be inferred from the nature of the text alignment problem.

3.3  Handling Texts

Recognizing and labelling surface phenomena in the text is a necessary prerequisite for most Natural Language Processing (NLP) systems. At this stage, texts are rendered into an internal representation that facilitates further processing. Basic text handling is performed by a MULTEXT-like tokenizer (Di Christo 1995) that identifies word boundaries, sentence boundaries, abbreviations, numbers, dates, enumerations and punctuations. The tokenizer makes use of a regular-expression based definition of words, coupled with downstream precompiled lists for abbreviations in many languages and simple heuristics. This proves to be quite successful in effectively recognizing sentences and words, with accuracy up to 96%. The text Handler is responsible for transforming the parallel texts from the original form in which they are found into a form suitable for the manipulation required by the application. Although its role is often considered as trivial, in fact it is crucial for subsequent steps that heavily rely on correct word boundary identification and sentence recognition. The lexical resources of the handler are constantly enriched with abbreviations, compounding lists, dates, number and enumeration formats for new languages.

A certain level of interactivity with the user is supported, by indirect manipulation of pre-fixed sentence splitting patterns such as a new line character followed by a word starting with an uppercase character. The text handler has been ported to the Unicode standard and the number of treated and tested languages has reached 14.

3.4  Sequences of lexical symbols

The streams of characters are read into the handler from the parallel texts and then successively processed through the different components to produce streams of higher level objects, that we call tokens. Streams of tokens are converted into sequences according to mapping rules that are presented in Table 1.

Origin / Symbol
abbreviation / a
number / d
date / h
enumeration / e
initials / i
right parehthesis[1] / k
left parenthesis / m
punctuation[2] / u
delimiters[3] / t
period / p

Table 1: Summary of sequences alphabet

Figure 1 details an extract of 5 sentence sequences that were created by a conversion of lexical information according to the rules presented in Table 1. For instance, the fourth sentence of the extract contains an enumeration symbol (e), a number symbol (d), four punctuation symbols (u) and finally a semicolon (t) that is functioning as sentence delimiter. The corresponding sentence index precedes each sentence sequence.

1 : d
2 :
3 : e u k d m u u u u u t
4 : e d u u u u t
5 : u u u t

Figure 1: Example of lexical sequences

Algorithm Description

4.1  Problem Formulation

A = {a, d, h, e, i, k, m, u, t, p} is the symbol alphabet.

Each sentence of the parallel texts is transformed into one sequence, according to Table 1. Empty sequences are valid sequences.

S is the set of all finite sequences of characters from A, having their origin in the Source text.

T is the set of all finite sequences of characters from A, having their origin in the Target text.

denotes the i-th sequence in S, corresponding to the i-th sentence of the Source text.

denotes the j-th sequence in T, corresponding to the j-th sentence of the Target text.

4.2  Methodology

The basic presupposition is that parallel files present local structural similarity. A certain degree of structural divergencies is expected at the level of sentence sequences, due to translational divergencies, as well as of blocks of sentence sequences due to assymetric m:n alignments, between source and target language blocks.

We argue that, parallel files contain a significant amount of source and target language blocks of surface lexical sequences, of variable length, that display a high degree of correlation. These blocks can be captured, in a highly reliable manner, based solely on internal evidence and local alignments can be established that play the role of anchors in any typical alignment process.

To this end, similarity is sought along two different dimensions (Figure 2): a)Horizontal or Intra-sentential, where corresponding source and target language sentence sequences are compared on the basis of the elements they contain, and b)Vertical or Inter-sentential, where corresponding source and target language blocks of sentence sequences, of variable length, are associated, based on the similarity of their constituent sentence sequences.

The overall alignment process flow is depicted in Figure 3. At each iteration, candidate blocks of sequences of variable size (N), starting at source file point n, are examined against the target file blocks of sequences, through a crosscorrelation process described in detail in the following section. Crosscorrelation is a mathematical operation, widely used in discrete systems, that measures the degree to which two sequences of symbols or two signals are similar. As shown in Figure 2, only pairs with maximum normalized deviation less than , a threshold originally introduced by (Kay 1991), are considered, where file_length is the mean value of the number of sentences contained in the parallel texts (Compute Target Search Window).

At each iteration, the search space of potential source anchors is reduced (Search Space Reduction) by examining only those source points n that could constitute the starting sequence of an anchor block. In order for a point ns to be considered as such, it should meet the following criteria :

$ Target File Point nt :

SIM(Sns,Tnt) ³ CCS, (1)

SIM(Sns,Tnt) - SIM(Sns-1,Tnt) ³ CCD (2)

SIM(Sns,Tnt) - SIM(Sns,Tnt-1) ³ CCD (3)

where, SIM( ) is the intra-sentential similarity measure, and CCS and CCD are the similarity threshold and deviation threshold respectively, descibed in detail in the following section.

At the end of each iteration, the best matching anchor area is stored in the global set of anchor points.

4.3  Intra-sentential Comparison

Sentences are encoded into one-dimensional sequences based on their respective surface lexical elements. However, during translation certain elements act as optional, thus introducing a significant amount of noise into the corresponding target language sequences. In order to reduce the effect of such distortion on the pattern matching process, sequences are normalized (k,m optional, d+àd, u+àu, dudàdd) and comparison is also performed on the normalized forms. The final similarity measure is produced by averaging the two individual scores. We have experimented with two alternative methods of intra-sentential comparison, Edit Distance and shared N-gram matching, described below. The latter, proved to be more efficient since it performs equally well while requiring 30 times less processing time (Willman 1994).