[GSoC] The Apertium Project and mine
This is the beginning of a set of articles in English and marked as [GSoC], standing for Google Summer of Code. I am working on The Apertium Project (Apertium’s wiki, wikipedia), I will here present my project, what I have done and what is the very next step.
The Apertium Project
Apertium is an free software machine translation (MT) platform that first began with closely related languages like Catalan-Spanish-Galician and so on… The goals of the project are mainly:
- to provide an “human comprehensible” translation for language-pairs that are not well supported by other MT engine and/or that are closely related.
- to be good (efficient) for build MT systems for smaller languages with little of no parallel corpora.
- to be fast.
It is indeed also a sandbox for linguistic researchers. My supervisor has been working on Welsh-English and Breton-French pairs for example.
My project is to build a multi-engine translation synthetizer for Apertium. Why? Because multi-engine machine translation (MEMT) researchers have shown that it could lead to improvements and Apertium’s translation is far from being perfect. Are Google translate or Systran perfect? No! We would like to combine the strengths (and not the weaknesses!) of different open source (free software compatible) engines. One of them is Moses, a statistical machine translation engine (MTE). As it is very different from Apertium, which is a ~shallow transfer ruled based MTE, we hope that it would enable us to use this differences to make a better translation by merging two translations: one from MOSES, one from Apertium.
The project would be to generate different hypotheses from the “best” possible combinations of the two translations and then to rank them and propose the better one(s). It should be able to deal with other MTE than Moses so the hypoptheses generation will be “generic” (syntactic). My part of the repository for this project is here, there is also my page on Apertium’s wiki. People that are interested in more details can consult my proposal (pdf).
What has been done
The strong point of Moses is that it needs only a corpus with pairwise traductions of phrases to statistically learn a pair. The weak point with MOSES is that it is very slow with big phrase-tables. Spectie (my beloved mentor whose name should be preceded by “Ô”) provided one corpus for Welch-English (cy-en) that will be used as . The generated phrase-table has to be shrinked down (it is currently 2.62 GB big) using statistical methods: Fisher exact test on frequencies of pairs. What I have done until now was to implement Johnson, J.H., Martin, J., Foster, G., and Kuhn, R. (2007): Improving Translation Quality by Discarding Most of the Phrasetable. My implementation prunes 30k lines in 6min. This time can still be lowered, but I have a lot of others things to do right now.
And the next step
We will compute both traductions and then merge them to form a composed traduction better than Apertium’s or MOSES’ alone. Easy? The very next step is to generate the aligment for words from both traductions. The first implementation that I will do will be a pair-wise (word to word) alignment and apply some kind of “minimal crossing edges” heuristic. To get the idea, it would be trying to minimize the crossing number on a word to word alignment. I don’t state that this is the way to align two hypotheses, and this is “already worked on” stuff. There are already other things in the TODO list that can be found in my proposal. This is for our next rendez-vous.