Multilinguisme + plurilinguisme

Το "Multilinguisme + plurilinguisme" και στο Facebook

Διαχειριστής: Μιχάλης Πολίτης, Αναπληρωτής Καθηγητής, ΤΞΓΜΔ Ιονίου Πανεπιστημίου

Τετάρτη 7 Νοεμβρίου 2012

The JRC-Acquis multilingual parallel corpus and Eurovoc (v. 3.0)

JRC logo

The JRC-Acquis multilingual parallel corpus and Eurovoc (v. 3.0)

1) Introduction

What are the Acquis Communautaire and the JRC-Acquis

The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable the the EU Member States. This collection of legislative text changes continuously and currently comprises selected texts written between the 1950s and now. At the beginning of the year 2007, the EU has 27 Member States and 23 official languages. The Acquis Communautaire texts exist in these languages, although Irish translations are not currently available. The Acquis Communautaire thus is a collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.
The Language Technology group of the European Commission's Joint Research Centre did not receive an authoritative list of documents that belong to the Acquis Communautaire. In order to compile the document collection distributed here, we selected all those CELEX documents (see below) that were available in at least ten of the twenty EU-25 languages (the official languages of the EU before Bulgaria and Romania joined in 2007) and that additionally existed in at least three of the nine languages that became official languages with the Enlargement of the EU in 2004 (i.e. Czech, Estonian, Hungarian, Lithuanian, Latvian, Maltese, Polish, Slovak and Slovene). The collection distributed here is thus an approximation of the Acquis Communautaire which we call the JRC-Acquis. The JRC-Acquis must not be seen as a legal reference corpus. Instead, the purpose of the JRC-Acquis is to provide a large parallel corpus of documents for computational linguistics research purposes.

The linguistic research interest of the JRC-Acquis

In (computational) linguistics, parallel corpora are useful resources that are used for many applications and purposes. Most parallel corpora exist for a small number of languages. To our knowledge, the JRC-Acquis with its 22 languages and its approximately 23,000 documents per language is the largest existing parallel corpus, if we take into account both its size and the number of languages covered. 
The AC and other Community legislation is publicly available on the European Commission's web sites. The Language Technology team of the Joint Research Centre (JRC, http://langtech.jrc.ec.europa.eu/) in Ispra, Italy, has attempted to identify the documents that are part of the AC, has downloaded them and converted them to XML format. The Bulgarian and Romanian documents were processed by the Romanian Academy of Sciences (http://www.racai.ro/). In further processing steps, the texts were cleaned of their footers and annexes, and they were sentence-aligned. Instead of using a single pivot language, all possible language pair combinations were aligned individually. This is useful due to the n-to-n relationship between aligned sentences, which often differs depending on the language pair involved.
For some of the documents, only preliminary translations were available. For the online texts in some of the languages, only the title has been translated, but the text displayed is English. An automatic language recognition tool was therefore used to filter out those texts that are displayed as being one language, but which are actually English. No manual check was carried out.
The European Commission's Office for Official Publications OPOCE manages the distribution rights of this aligned multilingual parallel corpus. OPOCE agreed that the corpus can be given to research partners for non-commercial use. See the section on licensing issues, below.

2) Statistics

The JRC-Acquis corpus (version 3.0) is currently available in 22 languages with the following distribution:

Language ISO code
Nº of texts
Text body
Signatures
Annexes
Total Nº words (text + signatures + annexes):
Total Nº words
Total Nº characters
Average nº words
Total Nº words
Total Nº words
bg
11384
16140819
104522671
1417.85
2170075
14114612
30146967
cs
21438
22843279
148972981
1065.55
7225300
16763733
46832312
da
23624
31459627
213468135
1331.68
2629786
16855213
50944626
de
23541
32059892
232748675
1361.87
2542149
16327611
50929652
el
23184
36453749
239583543
1572.37
2973574
16459680
55887003
en
23545
34588383
210692059
1469.03
3198766
17750761
55537910
es
23573
38926161
238016756
1651.3
3490204
19716243
62132608
et
23541
24621625
192700704
1045.9
1336051
14995748
40953424
fi
23284
24883012
212178964
1068.67
2677798
12547171
40107981
fr
23627
39100499
234758290
1654.91
3021013
19978920
62100432
hu
22801
28602380
213804614
1254.44
2529488
15056496
46188364
it
23472
35764670
230677013
1523.72
3120797
18331535
57217002
lt
23379
26937773
199438258
1152.22
2436585
15018484
44392842
lv
22906
27592514
196452051
1204.6
1673124
15437969
44703607
mt
10545
20926909
128906748
1984.53
1336042
15620611
37883562
nl
23564
35265161
231963539
1496.57
3039580
18467115
56771856
pl
23478
29713003
214464026
1265.57
2513141
17027393
49253537
pt
23505
37221668
227499418
1583.56
3034308
19350227
59606203
ro
6573
9186947
60537301
1397.68
514296
11185842
20887085
ro-19211 (Readme)
19211
30832212
182631277
1604.92
---
---
30832212
sk
21943
26792637
179920434
1221.01
3227852
16190546
46211035
sl
20642
27702305
178651767
1342.04
3103193
16837717
47643215
sv
20243
29433037
199004401
1453.99
2575771
14965384
46974192
Total
463792
636216050
4288962348
1387.23
60368893
358999011
1055583954
Statistics on the Vanilla alignment:
  • Total of 4,350,447 aligned documents (all languages);
  • Total of 243,187,303 links (all languages);
  • Average of 18,833 aligned documents per language;
  • Average of 1,052,759 links per language pair (average of all language pairs);
  • For details on the alignments per language pair, see the file alignment_statistics_by_language.txt.
  • Average of 85.43% of one-to-one links. For further details, see the file alignment_statistics_by_link_type.txt.

3) Source of the documents

All documents were downloaded from the websites http://europa.eu.int/ and http://ccvista.taiex.be. See the publication at LREC'2006 for details. The texts in the official EU languages (as of 2006) on http://europa.eu.int/ were found in html format, while the Bulgarian and Romanian translations on http://ccvista.taiex.be were found in MS-Word format.

4) Document conversion and processing

All documents have a numerical identifier called the CELEX code (see http://eur-lex.europa.eu/). This code helps to find the same text in the various languages.

Conversion from HTML and MS-Word to XML

After having downloaded the HTML documents (see Section 3), the documents were converted to XML. The title and body text were isolated, the paragraph breaks (<P> HTML tags) were kept. All texts were uniformly encoded with UTF-8.

Identification of footers / annexes

A list of rules was used to detect the beginning of the documents' annexes and signatures (repetitive and frequently multilingual text strings ending the documents) and to separate the text body from the less useful text parts. As the rules were hand-written by developers who do not speak the 22 languages, some signatures and annexes may have been missed and some may have been recognised wrongly.

Document format / DTD

The documents have the format as illustrated below. The DTD for this format is also provided with the distribution.
            <TEI.2 id="jrcCELEX-LG" n="CELEX" lang="LG">
            <teiHeader lang="en" date.created="DATE">
            <fileDesc>
                <titleStmt>
                    <title>JRC-ACQUIS CELEX LANGUAGE</title>
                    <title>Document Title</title>
                </titleStmt>
                <extent>nb_of_paragraphs paragraph segments</extent>
                <publicationStmt>
                    <distributor>
                        <xref url="http://wt.jrc.it/lt/acquis/">http://wt.jrc.it/lt/acquis/</xref>
                    </distributor>
                </publicationStmt>
                <notesStmt>
                    ....
                </notesStmt>
                <sourceDesc>
                        <bibl>Downloaded from <xref url="Downloading_URL">Downloading_URL</xref> on <date>Downloading_DATE</date></bibl>
                </sourceDesc>
            </fileDesc>
            <profileDesc>
                    <textClass>
                            <classCode scheme="eurovoc">Eurovoc_Code</classCode>
                                .....
                    </textClass>
            </profileDesc>
        </teiHeader>
        <text>
            <body>
                <head n="1">Document Title</head>
                <div type="body">
                    <p n="paragraph_number">... TEXT...</p>
                    .......
                </div>
                <div type="signature">
                    <p n="paragraph_number">... signature text...</p>
                        ....
                </div>
                <div type="annex">
                    <p n="paragraph_number">... annex text...</p>
                        ....
                 </div>
            </body>
        </text>
    </TEI.2>
   
Note that the title, body text, signature and annex further contain <p>...</p> tags. Each tag contains as attribute (n) its sequential number in the document, which is used in the paragraph alignment.

5) Sentence alignment of the texts across languages

Strictly speaking, the corpus is currently aligned at the paragraph level, as it was the <P> elements that were being aligned. However, the paragraphs of the AC Corpus are usually short and do usually contain one sentence, or even only part of a sentence.
The alignment was done using two different alignment programs: Vanilla and HunAlign. Vanilla was written by Pernilla Danielsson and Daniel Ridings. It implements the widespread Church and Gale / Dynamic Time Warping algorithm. The C source and documentation of the program are available at http://nl.ijs.si/telri/Vanilla/. HunAlign was described by Varga, Halácsy, Kornai, Nagy, Németh & Trón.
Having two alignments for the same bilingual corpora allows comparison and benchmarking of alignment tools and algorithms for multiple language pairs. To our knowledge, no alignment tools have been tested on so many different language pairs.
We decided to align the sentences of each language pair separately, instead of using one pivot language. As the corpus exists in twenty-two languages, there are 231 possible language pair combinations (462 language pair directions). For each individual language pair, we thus produced files containing the language pair-specific alignment information. These files contain, for each document identifier (CELEX number), pointers to the paragraphs (in the"n" attribute) that are translations of each other. The format used is that of the Text Encoding Initiative (TEI).
Due to the size of the corpus and the number of language pairs, the files do not contain the text itself. If you want to produce the parallel corpus for a specific language pair, you thus need to generate this corpus on the basis of the monolingual corpora (which all contain paragraph identifiers in the <p> tags) and the alignment information, using the tool provided by the JRC.
See the alignment directory for further information on how to generate such an aligned corpus. 

6) Eurovoc classification of the texts

Most CELEX (EU) documents have been manually classified according to the subject domains to which they belong. The classification scheme used is the Eurovoc thesaurus (http://eurovoc.europa.eu/), which is a multilingual wide-coverage conceptual thesaurus. The European Parliament, large parts of the European Commission and about twenty national and regional European parliaments use Eurovoc for the classification of their documents. The Eurovoc thesaurus consists of over 6,000 descriptor terms (classes) that are organised hierarchically into up to eight levels, using the relationships Broader Term - Narrower Term (BT-NT) to describe the hierarchical relationship, and Related Term (RT) to link descriptors that are related but not linked hierarchically. Additionally, synonyms and near-synonyms for some of the descriptors are listed, marked with the Use-For (UF) tag. Eurovoc exists in over twenty languages and is maintained actively. As the descriptors are defined precisely with Scope Notes, each descriptor has exactly one translation in each of the languages. Numerical descriptor IDs link the various language versions. This feature makes Eurovoc an ideal means for cross-lingual search and retrieval applications and more.
While the parliaments use professional human indexers to classify their documents manually, the JRC has been working on automating this task. For details, see http://langtech.jrc.ec.europa.eu/Eurovoc.html.
Most Acquis Communautaire texts have been classified manually with Eurovoc descriptors. The file celex-EurovocId.txt contains the lists of numerical descriptor IDs that have been assigned to each of the AC documents. As the AC documents have been written over a period of about fifty years and the Eurovoc thesaurus keeps evolving, the documents are indexed with different Eurovoc versions. The Eurovoc descriptor codes for documents older than 1995 are not currently available. Furthermore, a small number of newer documents also seems not to have been Eurovoc-indexed, so that Eurovoc descriptor codes are not available for all AC documents.
With this distribution, we provide the numerical Eurovoc descriptor codes. Should you be interested in the descriptor text (the class name in any of the EU languages), you will need to get the licence for Eurovoc from OPOCE.

7) Usage conditions / Licensing issues

Acquis Communautaire corpus

According to an agreement with the European Commission's Office for Official Publications OPOCE, the AC corpus can be used and distributed for research purposes, but the following usage conditions must be adhered to:
The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series as well as charters and treaties and ECJ case-law to be in the public domain. Prior written permission is thus not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgment is given to the European Communities and to the source, and provided that the additional guidelines set out below are respected.
(1) Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read:
'Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.'
(2) For the reasons stated in the disclaimer above, it is advisable to ensure that translations are made from the printed, authentic version of the Official Journal. This precaution, while minimizing the risk of error, does not confer any legal status whatsoever to the translated text. The following notice shall accompany the translated text, printed below the acknowledgment:
'Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities. Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder].'
Moreover, please note that we do not consider a "further commercial dissemination" the inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/thesis/studies/reports/books issued by third-party authors or publishers, whatever the means, and disseminated subject to payment.

Eurovoc thesaurus

Unlike the JRC-Acquis corpus, the Eurovoc thesaurus (http://eurovoc.europa.eu/) must not be used or disseminated without prior written permission from the European Commission's Office for Official Publications OPOCE. If you want to get the rights to use Eurovoc and to receive a copy of the multilingual thesaurus, please contact OPOCE at OP-INFO-COPYRIGHT@publications.europa.eu, mentioning the file reference number 2005-COP-395. To our knowledge, the licence is free of charge for research purposes and a commercial licence costs 500 Euro. To obtain a commercial licence, please contact OPOCE.

8) Related information

The JRC Workshop on Exploiting multilingual parallel corpora (26-27 September 2005) was dedicated to exploring methods to exploit the Acquis Communautaire and similar corpora. You find more information on the workshop web page http://langtech.jrc.europa.eu/0509_EU-Enlargement-Workshop.html.
A description of version 2.2 of the Acquis Communautaire corpus was published in the paper below. Please use this publication as a reference when you mention the JRC-Acquis. You may want to check the web site http://langtech.jrc.europa.eu/ for more up-to-date publications on the subject.
Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Available at http://langtech.jrc.europa.eu/.

9) Contributors

The following persons have contributed to the gathering, preparation and publication of the aligned Acquis Communautaire corpus:
Bruno Pouliquen   (Joint Research Centre, Italy)
Camelia Ignat   (Joint Research Centre, Italy)
Anna Widiger   (Joint Research Centre, Italy)
Mladen Kolar   (Joint Research Centre, Italy)

Tomaž Erjavec   (Jožef Stefan Institute, Slovenia)
Dan Tufiş   (Romanian Academy of Sciences, Romania)
Dániel Varga   (
Budapest University of Technology and Economics, Hungary)
Ralf Steinberger   (Joint Research Centre, Italy)

The Bulgarian and Romanian versions were compiled by Alexandru Ceausu, Radu Ion and Dan Stefanescu from the Romanian Academy of Sciences.

10) Contact

To obtain a licence of the Eurovoc thesaurus, please contact the European Commission's Office for Official Publications OPOCE at opoce-info-copyright@cec.eu.int, mentioning the file reference number 2005-COP-395 (see above).
For information about the AC corpus and related work, please contact Ralf Steinberger or another member of the JRC's Language Technology team (see http://langtech.jrc.europa.eu/JRC_staff.html) at the email address of the format Firstname.Lastname@jrc.ec.europa.eu. The postal address is shown below. We would be pleased to hear how you use the corpus.
European Commission
Joint Research Centre - IPSC
Ralf Steinberger
T.P. 267
Via E. Fermi 2749
21027 Ispra (VA)
Italy
Fax: (+39) 0332 78 5154
http://langtech.jrc.ec.europa.eu/
Ispra, Italy, 29 May 2008

Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου