Documentation for the Alignment Set Toolkit

Patrik Lambert (lamb[email protected]c.es)

May 17, 2006

Version 1.1

Contents

1 Description 2

1.1 The Alignment Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Description of storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Visualisation representations and formats . . . . . . . . . . . . . . . . . . . . . 5

1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Unlinked words representation . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Link weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Symmetrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Reference 8

2.1 Command-line Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Alignment Set deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

new function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Conversion between formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

chFormat method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

visualise method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

evaluate method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

The AlignmentEval class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Alignment processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

processAlignment method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

orderAsBilCorpus method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

adaptToBilCorpus method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

symmetrize method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

chooseSubsets method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Known problems 15

4 To Do List 16

5 Acknowledgements 16

1 Description

1.1 The Alignment Set

An Alignment Set is a set of pairs of sentences aligned at the word (or phrase) level.

We refer to ﬁle set as the ﬁles containing the alignment information in a given format

(only one ﬁle in Giza format, three separate ﬁles in Naacl format— cf section 1.2). In theory

an Alignment Set could be physically contained in one or more ﬁle sets, and could be only a

part of each ﬁle set. In the current implementation, it can only be contained in one ﬁle set.

However, it can be simply a part of this ﬁle set.

The attributes of an Alignment Set are (for exact details see the reference section: 2.2):

ﬁle sets The array of all the ﬁle sets where the Alignment Set is physically stored. In the

current implementation this array can only contain one element. The attributes of a ﬁle

set are:

location A hash containing the paths of all the ﬁles and directories where the alignment

information is stored. In some formats each component of the ﬁle set is stored in

a separate ﬁle. The way the location hash has to be speciﬁed for each format is

explained in section 2.2.

format The format of the ﬁle(s) where the Alignment Set is stored (cf section 1.2).

range The ﬁrst and last sentence pairs to be included in the Alignment Set.

In this toolkit an Alignment Set is a p e rl object, whose reference is passed to the methods

that use or process this Alignment Set.

1.2 Description of storage formats

In this section we describe the most widely used formats to store Alignment Sets in ﬁles.

TALP This is the default format in version 1.1 (but not in version 1.0). The Alignment Set

is stored in three separate ﬁles: one for the source sentences, one for the target sentences

and one for the alignments. In each ﬁle, each line corresponds to one sentence pair

and the sentence pair order is the same in all ﬁles. In this way source sentence, target

sentence and links of the same s entence pair share the same line number. Source and

target sentences are stored as plain te xt, and links as a sequence of position numbe rs

(starting from 1, as 0 is reserved for NULL word) separated by “s” or a dash (sure links),

or a p (possible links). Example:

source words ﬁl e

I can not say anything at this stage .

We will consider the matter .

target words ﬁle

As´ı , de momento , no puedo pronunciarme .

Deberemos examinar la cuesti´on .

source-target alignment ﬁle

1-7 2-7 3-6 4-8 5-8 7-1 8-4 9-9

1-2 2-2 3-2 4-3 5-4 6-5

GIZA The format used by the Giza toolkit (Al-Onaizan et al., 1999; Och, 2000). All in-

formation is contained in only one ﬁle, but since the alignment produced by the toolkit

is not symmetric, up to two ﬁles can be speciﬁed: source-to-target alignment ﬁle and

target-to-source alignment ﬁle. The alignment of each sentence pair in Giza occupies

three lines of the ﬁle:

# Sentence pair (1) source length 15 target length 17 alignment score : 9.53025e-19

es que el d´ıa dieciocho , francamente es del todo imposible , no le puedo encontrar

NULL ({ 13 }) it’s ({ 1 }) that ({ 2 }) the ({ 3 }) eighteenth ({ 4 5 }) ,

({ 6 }) frankly ({ 7 }) that’s ({ 8 }) totally ({ 9 10 }) impossible ({ 11

}) , ({ 12 }) i ({ 14 }) can’t ({ 15 }) find ({ 16 }) anything ({ }) . ({

17 })

The same sentence with source and target reversed could be:

# Sentence pair (1) source length 17 target length 15 alignment score : 1.12222e-22

it’s that the eighteenth , frankly that’s totally impossible , i can’t find

anything .

NULL ({ }) es ({ 1 }) que ({ 2 }) el ({ 3 }) d´ıa ({ }) dieciocho ({ 4 }) ,

({ 5 }) francamente ({ 6 7 8 }) es ({ }) del ({ }) todo ({ }) imposible ({

9 12 }) , ({ 10 }) no ({ }) le ({ }) puedo ({ 11 }) encontrar ({ 13 14 }) .

({ 15 })

NAACL This format is described in the summary paper of the HLT-NAACL 2003 Workshop

on Building and Using Parallel Texts (Mihalcea and Pedersen, 2003).

The Alignment Set is stored in three separate ﬁles: one for the source sentences, one for

the target sentences and one for the alignments.

source sentences ﬁle One line per sentence. The sentence number is marked as well

as the end of the sentence:

<s snum=0009> Mr. Speaker , my question is directed to the Minister of

Transport . </s>

target sentences ﬁle The format is the same as for the other sentences ﬁle:

<s snum=0008> bravo ! </s>

<s snum=0009> monsieur le Orateur , ma question se adresse `a le ministre

charg´e de les transports . </s>

alignments ﬁle There is one line per link. First the sentence number is indicated,

then the numb e r of the source token, then the number of the corresponding target

token. Two optional marks are S(sure)/P(possible) and the conﬁdence in the link

(not present in this example):

0008 4 2 S

0008 1 1 P

0008 2 1 P

0008 3 1 P

0009 1 1 S

0009 2 3 S

0009 3 4 S

0009 4 5 S

0009 5 6 S

0009 8 9 S

0009 9 10 S

0009 10 11 S

0009 11 13 S

0009 12 15 S

0009 13 16 S

0009 2 2 P

0009 6 7 P

0009 6 8 P

0009 7 7 P

0009 7 8 P

0009 11 14 P

0009 12 14 P

0009 0 12 P

BLINKER This format has been introduced in the Blinker project (Melamed, 1998a) and

is used by the Alpaco alignment editor (Pedersen and Rassier, 2003). This strict syntax

has been conceived for the manual annotation of a corpus of sentence pairs. There is

one ﬁle for the source sentences, one for the target sentences and a directory for each

annotator, containing one alignment ﬁle for each sentence pair. The two sentences ﬁles

have the syntax EN.sample.<snb> and FR.sample.<snb> where <snb> is the sample

number (minuscule and capital letters must be respected). The annotator’s directories

are named A<anb> (<anb> is the annotator number) and are situated in the same directory

as the sentence ﬁles. The alignment ﬁles, in each annotator’s directory, must be named

samp<snb>.SentPair<pnb>, where <pnb> is the sentence pair number. In the alignment

ﬁle there is one line per link, containing the source token number and the corresponding

target token number.

NOTE: in this library, only the alignment ﬁle name syntax (samp<snb>.SentPair<pnb>)

must be strictly respected (see section 2.2).

The best way to describe it is with an example:

directory tree A typical directory tree of a Blinker Alignment Set would be:

(project name)/

A1/

samp1.SentPair0

samp1.SentPair1

samp1.SentPair2

A2/

samp1.SentPair0

samp1.SentPair1

samp1.SentPair2

EN.sample.1

FR.sample.1

FR.sample.<snb> ﬁle It contains one line per sentence.

no , yo estaba pensando m´as hacia el seis , siete .

de acuerdo , d´ejame que mire .

EN.sample.<snb> ﬁle Same as the FR.sample.<snb> ﬁle.

no it isn’t , i was thinking more for about the sixth or the seventh .

right , let me take a look .

samp<snb>.SentPair<pnb> ﬁle The ﬁle corresponding to the ﬁrst sentence pair of the

previous e xample could be:

1 1

2 1

3 1

4 2

5 3

6 4

7 5

8 6

10 7

11 8

12 9

13 10

15 11

14 11

16 12

9 0

1.3 Visualisation representations and formats

The most popular ways of representing visually an alignment between two s entences are

drawing lines between linked words (ﬁgure 3) or marking the intersection of two linked words

in a matrix (ﬁg. 2). Another possibility is to simply enumerate the links (ﬁg. 1). Anyway

there is always the possibility to convert the Alignment Set ﬁles to Blinker format and visualise

them using the Alpaco editor (see section 1.2).

The visualise function outputs a ﬁle in one of these representations:

enumLinks This representation is available in two formats:

text The output is a text ﬁle that contains undirected correspondences for each sentence

pair, displayed as a succession of lines of the form: (source word< − aligned target

words).

latex Writes a (LaTeX) ﬁle where each sentence pair is represented as in ﬁgure 1: ev-

ery link is enumerated, and links are directed (source-to-target, target-to-source or

both). The alignment is shown in a (source token ← target token) form where ”↔”

corresponds to ”+”,”←” corresponds to ”−” and ”→” corresponds to ” −” in the

matrix representation with “cross” marks (see below).

NULL ¿ cu´antas personas van ?

NULL how many people are travelling ?

¿ → how

cu´antas ← how

cu´antas ↔ many

personas ↔ people

van ← are

van ↔ travelling

? ↔ ?

Figure 1: enumLinks representation, latex format

matrix Writes a ﬁle with, for each sentence pair, its number, the two sentences and a (source

× target) matrix representation of the alignment, as in ﬁgure 2. The ﬁrst column contains

the source tokens and the last row the target tokens. Two parameters determine the

maximum number of rows and columns that can be displayed in one matrix. If the

number of columns (target words of the alignment) is greater than the maximum, the

matrix is split in various matrices (each matrix having all rows). If the number of rows is

greater than the maximum, the alignment is displayed in the enumLinks representation.

The available formats so far are:

latex Note that some features of the graphics package used in the ﬁles cannot be dis-

played by the dvi viewers, the s olution is to create .ps ﬁles and view them with a

postscript viewer.

Depending of the type of information you want to observe, you may prefer a character

or another to mark the links:

cross This mark is appropriate to highlight the non-symmetry of alignments. In each

row (corresponding to a source token) an horizontal dash ”–” is written in each

column corresponding to a target token that is aligned with it. R eversely, in e ach

column (corresponding to a target token) a vertical dash ” −” is written in each row

corresponding to a source token aligned with it. A matrix point with both ” −” and

”–” is marked as ”+”. GIZA note: if a ﬁle comes from a Giza training, there can’t

be more than one vertical dash in each line or more than one horizontal dash in each

column (the dashes tend to be oriented parallel to the lines and not perpendicular

to them).

ambiguity Sure links are marked with “S” and possible (ie ambiguous) links are marked

with “P”. If a link has no ambiguity information, cross marks are used instead.

conﬁdence Links are marked with their conﬁdence (or probability) between 0 and 1. If

a link has no conﬁdence information, cross marks are used instead.

personalised mark Links are marked with the mark you pass as argument (don’t forget

it will be inserted in a Latex ﬁle).

NOTE: if there exists a targetToSource alignment, its links will be marked with the

same type of mark as the sourceToTarget alignment, but rotated from 90 degrees to

the left.

NULL ¿ cu´antas personas van ?

NULL how many people are travelling ?

? . . . . . . +

van . . . . − + .

personas . . . + . . .

cu´antas . − + . . . .

¿ . − . . . . .

NULL . . . . . . .

NULLhow manypeopleare travelling?

Figure 2: matrix representation, latex format, cross mark

drawLines (Not implemented yet) The idea is to produce a ﬁle with, for each sentence

pair, its number, the two sentences and a picture with the tokens aligned horizontally or

vertically, with lines drawn between linked tokens.

Figure 3: drawLines representation

1.4 Evaluati on

A consensus on word alignment evaluation methods has started to appear. These methods

are described in (Mihalcea and Pedersen, 2003). Submitted alignments are compared to a

manually aligned reference corpus (gold standard) and scored with res pect to precision, recall,

F-measure and Alignment Error Rate (AER). An inherent problem of the e valuation is the

ambiguity of the manual alignment task. The annotation criteria depend on each annotator.

Therefore, (Och and Ney, 2003) introduced a reference corpus with explicit ambiguous (called

P or Possible) links and unambiguous (called S or Sure) links. Given an alignment A, and

a gold standard alignment G, we can deﬁne sets A

, A

and G

, G

, corresponding to the

sets of Sure and Possible links of each alignment. The set of Possible links is also the union

of S and P links, or equivalently A

⊆ A

and G

⊆ G

. The following measures are deﬁned

(where T is the alignment type, and can be set to either S or P):

∩ G

, R

∩ G

, F

+ R

AER = 1 −

∩ G

| + |A

∩ G

| + |G

1.4.1 Unlinked words representation

The scores are greatly aﬀected by the representation of NULL links (between a word and

no other word: whether they are assigned an explicit link to NULL or removed from the

alignments). Explicit NULL links contribute to a higher error rate because in this case the

errors are penalised twice: for the incorrect link to NULL and for the missing link to the

correct word. Thus both submitted and answer alignments must have the same alignment

mode, which can be one of the following:

• null-align, where each word is enforced to belong to at least one alignment ; if a word

doesn’t belong to any alignment, a NULL Possible link is assigned by default.

• no-null-align, where all NULL links are removed from both submission and gold standard

alignments.

1.4.2 Link weights

In the evaluations of (Och and Ney, 2000; Mihalcea and Pedersen, 2003), each link con-

tributes with the same weight to the count of the various sets. This tends to give more

importance to the words aligned in groups than to the words linked only once. To correct

this eﬀect, (Melamed, 1998b) proposes to attach a weight to each link. The weight w(x, y)

of a link between two words x and y would be inversely proportional to the number of links

(num

links) in which x and y are involved:

w(x, y) =



num links(x)

num links(y)



(1)

1.4.3 Implementation

The evaluate function can force the alignments to be both in “null-align” or “no-null-

align” mode. The seven measures of the result set (precision, recall and F-measure for sure

alignments and for possible alignments, AER rate) are saved in a class called AlignmentEval.

You c an call it to display a single measure set or a table c omparing various result sets.

The evaluate function calculates weighted links if its ’weighted’ argument is true. The

weights must be calc ulated with respect to the union of the submitted and reference sets. For

the measures involving only Sure alignments (P

and R

), they are calculated with respect to

the union of the Sure sets: A

∪ G

. For the measures mixing Sure and Possible alignments

(AER, P

and R

), the weights are calculated based on the union of the Possible sets:

∪ G

= A ∪ G.

1.5 Symmetrisation

This section concerns Alignment Sets containing asymmetric source-to-target and target-to-

source alignments. Combining the source-target and target-source information of the align-

ments, we can obtain a high precision with low recall alignment (taking the intersection), a

low precision with high recall alignment (taking the union), or intermediate combinations.

Such intermediate symmetrisation algorithm have been proposed by (Och and Ney, 2003;

Koehn et al., 2003; Lambert and Castell, 2004a).

So far the symmetrize function only implements the algorithm described in (Lambert

and Castell, 2004a). The algorithm combines two single-word base d alignments to produce a

symmetric, phrase-based alignment. It exploits the asymmetries in the superposition of the

two word alignments to detect the phrases that must be aligned as a whole. The central idea

is that if the asymmetry is caused by a language feature such as an idiomatic expression, it

will be repeated various times in the corpus, otherwise it will occur only once. So the training

must be done in two stages: ﬁrst, the building of the asymmetries memory. Second, the

alignment correction using this memory.

2 Reference

2.1 Command-line Tools

Important: a series of command-line tools are available, with internal documentation (which

is displayed adding the option -man). These tools encapsulate calls to the perl library and do

the same as the subroutines described here, without the need to read the following reference

and to write perl code.

2.2 Alignment Set deﬁnition

• new function

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

package no

file sets Ref to an array containing references

to each ﬁle set (cf below the at-

tributes of a ﬁle set). In this version

there can only be one ﬁle set.

reference string no

OUTPUT Reference to an Alignment Set object

The input of the new function is a reference to an array of the type (refToFileSet1,refToFileSet2,...).

In the present version it is a reference to the array (refToUniqueFileSet). A ﬁle set is

represented by an array of the type (location,format,range), where:

Values

Deﬁnition Rank or Type

optional

(default)

location Ref to a hash containing the

path and name of the ﬁles

(or directory for BLINKER). If

‘‘sourceToTarget’’ is the only

entry of the hash, the (string) path

can be passed instead of the hash

ref

reference string (see

below), or string

format Format in which the Alignment Set

is stored

{’BLINKER’,’GIZA’,

’NAACL’,’TALP’}

yes (’TALP’)

range First and last sentence pairs to be

included in Alignment Set

’a-b’ where a and b

are a number or the

empty string

yes (’1-’)

The parameters required by each format for the location hash are:

GIZA: location=(

‘‘sourceToTarget’’ ⇒ source-to-target ﬁle path,

‘‘targetToSource’’ ⇒ target-to-source ﬁle path (optional) )

NAACL and TALP: location=(

‘‘source’’ ⇒ source sentences ﬁle path (optional for some functions),

‘‘target’’ ⇒ target sentences ﬁle path (optional for some functions),

‘‘sourceToTarget’’ ⇒ source-to-target alignment ﬁle path,

‘‘targetToSource’’ ⇒ target-to-source alignment ﬁle path (optional) )

BLINKER: location=(

‘‘source’’ ⇒ source sentences ﬁle path (optional for some functions),

‘‘target’’ ⇒ target sentences ﬁle path (optional for some functions),

‘‘sourceToTarget’’

⇒ directory of source-to-target samp<snb>.SentPair<pnb>

ﬁles,

‘‘targetToSource’’

⇒ directory of target-to-source samp<snb>.SentPair<pnb>

ﬁles (optional) )

‘‘source’’ and ‘‘target’’ ﬁle names don’t need to comply with the Blinker

syntax. However, if the source sentence ﬁle does comply with it, the sample number

is deduced from the name (otherwise it is assumed to be 1).

‘‘sourceToTarget’’ and ‘‘targetToSource’’ directory names don’t need to re-

spect the blinker notation nor to be situated in the same directory as the sentence

ﬁles. However the samp<snb>.SentPair<pnb> syntax is compulsory.

Note: if ‘‘sourceToTarget’’ is the only entry of the hash, the (string) path can be

passed instead of the hash reference.

Code samples:

# alternative 1

$location1 = {"source"=>$ENV{ALDIR}."/spanish.naacl",

"target"=>$ENV{ALDIR}."/english.naacl",

"sourceToTarget"=>$ENV{ALDIR}."/spanish-english.naacl"};

$fileSet1 = [$location1,"NAACL","1-10"];

$fileSets = [$fileSet1];

$alSet = Lingua::AlignmentSet->new($fileSets);

# alternative 2

$alSet = Lingua::AlignmentSet->new([[$location1,"NAACL","1-10"]]);

# alternative 3

$alSet = Lingua::AlignmentSet->new([[$ENV{ALDIR}."/spanish-english.naacl",

"NAACL","1-10"]]);

$alSet->setWordFiles($ENV{ALDIR}."/spanish.naacl",$ENV{ALDIR}."/english.naacl");

• copy method: creates a new AlignmentSet containing the same data than an existing

one, without c opying the addresses.

my $newAlSet = $alSet->copy;

• setWordFiles method: sets the sentence ﬁles. The two arguments are (s ourceFileName,

targetFileName). If you ﬁrst created an AlignmentSet with only the alignment ﬁles, use

this function before calling subroutines that require the sentence ﬁles, like the visualise

sub.

• setSourceFile, setTargetFile, setTargetToSourceFile methods: same thing for the

source, target word ﬁles and the targetToSource alignment ﬁle, respectively. They take

as argument the corresponding ﬁle name .

2.3 Conversion between formats

• chFormat method :

Converts an Alignment Set to the speciﬁed format: creates, at the speciﬁed location, the

necessary ﬁle(s) and directory and stores there the Alignment Set in the speciﬁed format.

It cannot delete the old format ﬁles. It starts counting the new Alignment Set sentence

pairs from 1, even if in the original Alignment Set they start from a diﬀerent number.

If the sentence ﬁles (’source’ and ’target’ entries of location hash) are present in the

old Alignment Set but omitted in the new one, these entries are copied anyway to the

new location hash, except if the format is diﬀerent or if the original Alignment Set

range doesn’t start from the ﬁrst sentence pair. In this last case, a gap could be indeed

introduced between the numeration of the sentence and alignment ﬁles.

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

alSet Reference to the input Alignment

Set

reference string no

location Ref to a hash containing the path

and name of the ﬁles in the NEW

format. If “sourceToTarget” is the

only entry of the hash, the (string)

path can be passed instead of the

hash ref

reference string

(cf section 2.2), or

string

format Required new format {’BLINKER’,’GIZA’,

’NAACL’,’TALP’}

yes (’TALP’)

alignMode Take alignment “as is” or force

NULL alignment or NO-NULL

alignment (see section 1.4)

{’as-is’, ’null-align’,

’no-null-align’}

yes (’as-is’)

# convert from GIZA to NAACL format

my $spa2eng = Lingua::AlignmentSet->new([["$ENV{ALDIR}/al.giza.eng2spa","GIZA","1-15"]]);

my $newLocation = {"source"=>$ENV{ALDIR}."/naacl-source.1.1-15",

"target"=>$ENV{ALDIR}."/naacl-target.1.1-15",

"sourceToTarget"=>$ENV{ALDIR}."/naacl-al-st.1.1-15"};

$spa2eng->chFormat($newLocation,"NAACL");

2.4 Visualisation

• visualise method

This function requires the sentence ﬁles (’s ource’ and ’target’ keys of location hash). You

can also use the setWordFiles function (see section 2.2) to give the path for these ﬁles.

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

alSet Reference to the input Alignment

Set

reference string no

representation Type of visual representation re-

quired

{’enumLinks’, ’ma-

trix’, ’drawLines’}

format Format of the output ﬁle {’text’, ’latex’} no

filehandle Reference to the ﬁlehandle of the ﬁle

where the output s hould be written

reference string no

mark How a link is marked in the ma-

trix. your mark is the mark of your

choice (but compatible with latex)

{’cross’, ’ambigu-

ity’, ’conﬁdence’,

your mark}

yes (cross)

alignMode Take alignment “as is” or force

NULL alignment or NO-NULL

alignment (see section 1.4)

{’as-is’, ’null-align’,

’no-null-align’}

yes (’as-is’)

maxRows Maximum number of rows (source

words) allowed in a matrix. If there

are more, the alignment is displayed

as ‘‘enumLinks’’,’’latex’’

integers yes (53)

maxCols Maximum number of columns (tar-

get words) allowed in a matrix. If

there are more, the matrix is con-

tinued below

integers yes (35)

# Example. Creating a latex file with alignment matrices:

open(MAT,’’>’’.$ENV{ALDIR}.’/alignments/test.eng-french.tex’);

$alSet->visualise("matrix’’,"latex’’,*MAT);

# or, with a personalised mark in matrix:

$alSet->visualise("matrix’’,"latex’’,*MAT,’$\blacksquare$’);

2.5 Evaluati on

• evaluate method This function integrates the code of Rada Milhalcea wa eval align.pl

routine (http://www.cs.unt.edu/ rada/wpt/code/). If the reference Alignment Set (Gold

Standard) and the submission Alignment Set are both in NAACL format and both satisfy

the same alignment strategy for NULL alignments (null-align or no-null-align), you can

use your ﬁles “as-is”. It will be more eﬃcient. Otherwise you can choose null-align or

no-null-align alignment mode to make sure both Alignment Sets are treated in the same

way.

The sentence ﬁles (’source’ and ’target’ entries of location hash) are not taken into account

and c an be omitted.

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

submissionAlSet Ref to the Alignment Set to be eval-

uated

reference string no

answerAlSet Ref to the reference Alignment Set

(Gold Standard)

reference string no

alignMode Take alignment “as is” or force

NULL alignment or NO-NULL

alignment (see section 1.4)

{’as-is’,’null-

align’,’no-null-

align’}

yes (’as-is’)

weighted If true weights the links according to

the method of (Melamed, 1998b)

{0,1} yes (false)

OUTPUT Reference to an evaluation result object

• AlignmentEval class This is a hash with the name of the seven measures ([sure|possible]

[Precision|Recall|FMeasure] and AER) and their value. You use its methods to display

and c ompare evaluation results:

– display sub Prints the measures in a table. The arguments are:

* evaluationMeasure: ref to the evaluationResult object

* ﬁleHandle: where you want print it (optional, default:STDOUT)

* format: for now only “text” available (optional)

– compare sub Prints in a comparative table the results of various evaluations. The

arguments are:

* results: ref to an array of arrays. The ﬁrst array has one entry for each result

set, the s econd one has 2 entries (the evaluationResult object reference and a string

describing the experiment), ie

[[$refToAlignmentEval1,’description1’], [$refToAlignmentEval2,’description2’],...]

* title: A title which appears above the table

* ﬁleHandle: same as in display

* format: “text” or “latex” (optional, default “text”)

Code samples for evaluation routines:

# Creating an evaluationResult object and pushing it into the evaluation array:

$evaluationResult=$s2e->evaluate($goldStandard,"no-null-align");

push @evaluation,[$evaluationResult,"Non weighted"];

# Doing the same in one step:

push @evaluation,[$s2e->evaluate($goldStandard,"no-null-align",1),"Weighted"];

# Displaying in a table the two result lines:

Lingua::AlignmentEval::compare(\@evaluation,"My experiments",\*STDOUT,"latex");

2.6 Alignment processing

• processAlignment method

Allows to process the AlignmentSet applying a function to the alignment of each sentence

pair of the s et. The Alignment.pm module contains such functions:

General subroutines:

– forceGroupConsistency: prohibits situations of the type: if linked(e,f) and

linked(e’,f) and linked(e’,f’) but not linked(e,f’). In this case the function links

e and f’.

– swapSourceTarget: swaps source and target in the alignments (transforms a link

(6 3) in (3 6)).

– regexpReplace: Substitutes, in a side of the corpus, a string (deﬁned by a regular

expression) by another and updates the links accordingly. There are 3 arguments:

the regular expressions (pattern and replacement) and the side (source or target)

(see the man examples). Notes:

∗ In case of deleting various words, all added words are linked to all positions

to which deleted words were linked. $al− > sourceLinks information can be

lost for replaced words.

∗ The regexp is applied to the side of the corpus, and the smallest set of additions

and deletions necessary to turn the original word sequence into the modiﬁed

one is computed using algorithm::diﬀ. In practice, this set is not always mini-

mal, and in these cases various words are replaced by various so links may be

changed. To avoid this problem use replaceWords subroutine.

∗ Is more eﬁcient in ”source” side than in ”target” side.

– replaceWords: Substitutes, in a side of the corpus, a string (of words separated

by a white space) by another and updates the links accordingly. There are 3

arguments: the string of words to be replaced, the string of replacement words

and the side (source or target) (see the man examples). Notes:

∗ In case of deleting various words, all added words are linked to all positions

to which deleted words were linked. $al− > sourceLinks information can be

lost for replaced words.

∗ Is more eﬁcient in ”source” side than in ”target” side.

– manyToMany2joined: introduces underscore between links of many-to-many groups

in source to target alignment. Warning: this subroutine only changes words ﬁles,

not the links ﬁle.

– joined2ManyToMany: recreates links of words linked by underscore and removes

underscores

– getAlClusters: gets the alignment as clusters of positions aligned together.

Subroutines which combine source-target and target-source alignments:

– intersect

– getUnion

– selectSideWithLeastLinks: for each sentence pair, selects, from source-target

and target-s ource alignment, that with the highest number of individual links.

– selectSideWithMostLinks

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

alSet Reference to the input Alignment

Set

reference string no

AlignmentSub Name of a subroutine of the Align-

ment.pm module. This subroutine

is applied to each sentence pair. If

the subroutine takes N arguments,

AlignmentSub is a reference to the

following array: (sub name, arg1, ...

, argN)

string or reference

string

location Ref to a hash containing the path

and name of the ﬁles in the NEW

format. If “sourceToTarget” is the

only entry of the hash, the (string)

path can be passed instead of the

hash ref

reference string

(cf section 2.2), or

string

format Required new format {’BLINKER’,’GIZA’,

’NAACL’,’TALP’}

yes (’TALP’)

alignMode Take alignment “as is” or force

NULL alignment or NO-NULL

alignment (see section 1.4)

{’as-is’, ’null-align’,

’no-null-align’}

yes (’as-is’)

OUTPUT Reference to an Alignment Set object

As mentioned for the chFormat sub, if the original alSet object contained the sentences

ﬁles, the corresponding entries of the location hash are conserved, exce pt if the format is

diﬀerent or if the range doesn’t s tart with the ﬁrst sentence pair.

# create the union of source-target and target-source alignments

$s2e = Lingua::AlignmentSet->new([[{"sourceToTarget" => $spa2engTest,

"targetToSource" => $eng2spaTest},"GIZA"]]);

my $union=$s2e->processAlignment("Lingua::Alignment::getUnion","union.naacl","NAACL");

# remove ?!. signs from the Blinker reference corpus and save it as Naacl file:

#1 remove from target side of the corpus:

my $answer = Lingua::AlignmentSet->new([["data/answer/spanish-english","BLINKER"]]);

# need to add the target words file because we remove from that side:

$answer->setTargetFile("data/answer/english.blinker");

#define output location:

$newLocation = {"target"=>"data/english-without.naacl",

"sourceToTarget"=>"data/spanish-english-interm.naacl"};

my $refToArray = ["Lingua::Alignment::replaceWords",’\.|\?|!’,’’,"target"];

my $output = $answer->processAlignment($refToArray,$newLocation);

#2 Remove now from source side:

# we take as input alignment the previous one (already removed in target side).

# however, to remove from source, we need to add the source words:

$output->setSourceFile("data/spanish.naacl");

#define output location:

$newLocation = {"source"=>"data/spanish-without.naacl",

"sourceToTarget"=>"data/spanish-english-without.naacl"};

$refToArray = ["Lingua::Alignment::replaceWords",’\.|\?|!’,’’,"source"];

$output = $output->processAlignment($refToArray,$newLocation);

• orderAsBilCorpus method Place sentence pairs of a secondary corpus at the head of the

Alignment Set, in the same order. See more details with the command-line tool (perl

orderAlSetAsBilCorpus.pl -man).

• adaptToBilCorpus method Looks if the Alignment Set sentence pairs are in another

bilingual corpus, and for each sentence pair which is not in the corpus, it searches the

corpus sentence pair with best longuest common subsequence (LCS) ratio. Finally, it

detects the edits (word insertions, deletions, and substitutions) neces- sary to pass from

the Alignment Set sentences to the corpus sentences with best LCS ratio, prints the

edit list and transmits these edits in the output links ﬁle. See more details with the

command-line tool (perl adaptAlSetToBilCorpus.pl -man).

• symmetrize method

INPUT

parameters

Deﬁnition Rank or Type

optional

(default)

alSet Reference to the input Alignment

Set

reference string no

location Ref to a hash containing the path

and name of the ﬁles in the NEW

format. If “sourceToTarget” is the

only entry of the hash, the (string)

path can be passed instead of the

hash ref

reference string

(cf section 2.2), or

string

format Required new format {’BLINKER’,’GIZA’,

’NAACL’,’TALP’}

yes (’TALP’)

groupsDir Directory of the asymmetric phrases

memory ﬁles ’groups’ and ’sub-

Groups’

directory s tring yes (””)

selectPhrases if true, it selects the asymmet-

ric phrases and save them in

the ”$groupsDir/groups” (and ’sub-

Groups’) ﬁles and updates the align-

ment. If false, it only updates the

alignment (using the existing mem-

ory).

boolean yes (false)

options Reference to a hash of options reference string yes

The defaults of the options hash are pretty reasonable. However you might want to give

a higher value to ”minPhraseFrequency” if you have m any data. You can get more recall

or precision changing the ”defaultActionGrouping” option. The options hash has the

following entries (for more details see (Lambert and Castell, 2004b)):

minPhraseFrequency (default 2) The minimum number of occurrences necessary to

select a phrase in the memory.

onlyGroups (default 1) In the alignment update stage, considers only phrases of the

’groups’ ﬁle (that are strictly asymmetric zones of the alignment combination) or

considers also phrases of the ’subGroups’ ﬁle (which are subgroups of the phrases in

the ’groups’ ﬁles).

defaultActionGrouping (default "Lingua::Alignment::getUnion") Action to take

if there is no applicable phrase in the me mory.

defaultActionGeneral (default "Lingua::Alignment::intersect") Action to take if

the asymmetric zone is too small or too big to be reasonably linked as a group

(normally the best is to take the intersection to avoid a drop in precision).

2.7 Miscellaneous

• chooseSubsets method Returns a randomly chosen list (in random order) of line Num-

bers contained in the AlignmentSet object. To sort this list, do:

my @sortedSelection = sort { $a <=> $b; } @selection;

chooseSubsets takes an additional argument, the size of the desired subset.

• getSize method Method which calculates the number of sentence pairs in the Alignment

Set.

3 Known problems

By deﬁnition of the Alignment Set, the numeration in the input and output ﬁles can be

distinct. Indeed, the sentence pairs of an Alignment Set could in theory be stored in various

ﬁle sets.

If the range of your Alignment Set doesn’t start from the ﬁrst sentence pair and your are

converting or processing the alignments, keep in mind that the numeration after conversion

will be diﬀerent from that before the conversion.

4 To Do List

If you need a tool which is not present yet in this library, why not considering including it ?

Examples of further developments of the library could be:

• Allow an Alignment Set to be contained in various ﬁle sets.

• Implement the drawLines way of visualising an alignment (see section 1.3), which is the

most intuitive.

• Implement other symmetrisation methods.

• Add other types of evaluation measures, which wouldn’t have the limitations of the AER

(see (Lambert and Castell, 2004a)).

5 Acknowledgements

The author want to thank Adri`a de Gispert for testing the initial versions of the library and

for his helpful comments. This library includes some code from Rada Mihalcea. This work

has been (partially) granted by the Spanish Government under grant TIC2002-04447-C02.

References

Y. Al-Onaizan, J. Curin, M. Jahr, K. Knight, J. Laﬀerty, D. Melamed, F. J.

Och, D. Purdy, N. A. Smith, and D. Yarowsky. 1999. Statistical ma-

chine translation. Technical rep ort, John Hopkins University Summer Workshop.

http://www.clsp.jhu.edu/ws99/projects/mt/.

P. Koehn, F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proc. of

the 41th Annual Meeting of the Association for Computational Linguistics.

P. Lambert and N. Castell. 2004a. Alignment of parallel corpora exploiting asymmetrically

aligned phrases. In Proc. of the LREC 2004 Workshop on the Amazing Utility of Parallel

and Comparable Corpora, Lisbon, Portugal, May 25.

Patrik Lambert and N´uria Castell. 2004b. Evaluation and symmetrisation of alignments

obtained with the giza++ software. Technical Report LSI–04–15–R, Technical University

of Catalonia. http://www.lsi.upc.es/dept/techreps/techreps.html.

I. Dan Melamed. 1998a. Annotation style guide for the blinker project. Technical Report

98-06, IRCS.

I. Dan Melamed. 1998b. Manual annotation of translational equivalence. Technical Report

98-07, IRCS.

Rada Mihalcea and Ted Pedersen. 2003. An evaluation exercise for word alignment. In Rada

Mihalcea and Ted Pedersen, editors, HLT-NAACL 2003 Workshop: Building and Using

Parallel Texts: Data Driven Machine Translation and Beyond, pages 1–10, Edmonton,

Alberta, Canada, May 31. Association for Computational Linguistics.

Franz Jose f Och and Hermann Ney. 2000. A comparison of alignment models for statistical

machine translation. In Proc. of the 18th Int. Conf. on Computational Linguistics, pages

1086–1090, Saarbrucken,Germany, August.

F.J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models.

Computational Linguistics, 29(1):19–51, March.

Franz Josef Och. 2000. Giza++: Training of statistical translation models.

http://www.isi.edu/˜och/GIZA++.html.

Ted Pedersen and Brian Rassier. 2003. Aligner for parallel corpora.

http://www.d.umn.edu/˜tpederse/parallel.html.