CSCI 544 — Applied Natural Language Processing


Homework 8

Due: April 29, 2016, at 23:59 Pacific Time (11:59 PM)

Assignments turned in after the deadline but before May 4 are not subject to penalty. Assignments will not be accepted after May 4.


Update (May 2): Added test case with two reference translations.


Overview

In this assignment you will implement a program that calculates the BLEU evaluation metric, as defined in Papineni, Roukos, Ward and Zhu (2002): Bleu: a Method for Automatic Evaluation of Machine Translation, ACL 2002. You will run the program on sets of candidate and reference translations, and calculate the BLEU score for each candidate. The assignment will be graded on how closely your calculated BLEU score matches the true BLEU score.

Program

You will write a Python program which will take a two paths as parameters: the first parameter will be the path to the candidate translation (a single file), and the second parameter will be a path to the reference translations (either a single file, or a directory if there are multiple reference translations). The program will write an output file called bleu_out.txt which contains a single floating point number, representing the BLEU score of the candidate translation relative to the set of reference translations. If you use Python 2.7, name your program calculatebleu.py; if you use Python 3.4, name your program calculatebleu3.py. For example, your program will be expected to handle:

> python calculatebleu.py /path/to/candidate /path/to/reference

You can test your program by running it on the following candidate and translation files, and comparing the result to the true BLEU score.

Language Candidate Reference BLEU score
German candidate-1.txt reference-1.txt 0.151184476557
Greek candidate-2.txt reference-2.txt 0.0976570839819
Portuguese candidate-3.txt reference-3.txt 0.227803041867
English candidate-4.txt
reference-4a.txt
reference-4b.txt
0.227894952018

The German, Greek and Portuguese reference translations above are excerpted from the common test set of the EUROPARL corpus; the candidate translations were obtained by taking the corresponding English sentences and running them through Google Translate. The English reference translations are from two alternative translations of the Passover Hagaddah; the candidate translation was obtained by running the original Hebrew text through Google translate. The actual test will be done with similar files.

Notes

Grading

Collaboration and external resources

Submission

All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.

Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The performance of you program will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.

If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.