Assignments turned in after the deadline but before May 4 are not subject to penalty. Assignments will not be accepted after May 4.
Update (May 2): Added test case with two reference translations.
In this assignment you will implement a program that calculates the BLEU evaluation metric, as defined in Papineni, Roukos, Ward and Zhu (2002): Bleu: a Method for Automatic Evaluation of Machine Translation, ACL 2002. You will run the program on sets of candidate and reference translations, and calculate the BLEU score for each candidate. The assignment will be graded on how closely your calculated BLEU score matches the true BLEU score.
You will write a Python program which will take a
two paths as parameters: the first parameter will be the path
to the candidate translation (a single file), and the second parameter will be
a path to the reference translations (either a single file, or a
directory if there are multiple reference translations).
The program will write an output file called
bleu_out.txt
which contains a single floating point
number, representing the BLEU score of the candidate translation relative
to the set of reference translations.
If you use Python 2.7, name your program calculatebleu.py
;
if you use Python 3.4, name your program calculatebleu3.py
.
For example, your program will be expected to handle:
> python calculatebleu.py /path/to/candidate
/path/to/reference
You can test your program by running it on the following candidate and translation files, and comparing the result to the true BLEU score.
Language | Candidate | Reference | BLEU score | ||
---|---|---|---|---|---|
German | candidate-1.txt | reference-1.txt | 0.151184476557 | ||
Greek | candidate-2.txt | reference-2.txt | 0.0976570839819 | ||
Portuguese | candidate-3.txt | reference-3.txt | 0.227803041867 | ||
English | candidate-4.txt |
|
0.227894952018 |
The German, Greek and Portuguese reference translations above are excerpted from the common test set of the EUROPARL corpus; the candidate translations were obtained by taking the corresponding English sentences and running them through Google Translate. The English reference translations are from two alternative translations of the Passover Hagaddah; the candidate translation was obtained by running the original Hebrew text through Google translate. The actual test will be done with similar files.
min (your-bleu, true-bleu) |
max (your-bleu, true-bleu) |
All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.
Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The performance of you program will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.
If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.