Assignments may be turned in until March 19 without penalty.
Note: USC Spring Recess is March 12–19, and the due date of the assignment is before Spring Recess. Due to student requests, I am allowing submission until the end of Spring Recess. However, keep in mind that during Spring Recess there will be no office hours, and the instructional staff will spend less time than usual on Piazza. Therefore, students are strongly encouraged to complete the assignment before Spring Recess.
In this assignment you will write a Hidden Markov Model part-of-speech tagger for Catalan. The training data is provided tokenized and tagged; the test data will be provided tokenized, and your tagger will add the tags. The assignment will be graded based on the performance of your tagger, that is how well it performs on unseen test data compared to the performance of a reference tagger.
A set of training and development data will be made available as a compressed ZIP archive on Blackboard. The uncompressed archive will have the following format:
The grading script will train your model on all of the tagged training and development data, and test the model on unseen data in a similar format.
You will write two programs: hmmlearn.py
will learn a
hidden Markov model from the training data, and
hmmdecode.py
will use the model to tag new data. If
using Python 3, you will name your programs hmmlearn3.py
and hmmdecode3.py
. The learning program will be invoked
in the following way:
> python hmmlearn.py /path/to/input
The argument is a single file containing the training data; the program
will learn a hidden Markov model, and write the model parameters to a
file called hmmmodel.txt
. The format of the model is up to
you, but it should contain sufficient information for
hmmdecode.py
to successfully tag new data.
The tagging program will be invoked in the following way:
> python hmmdecode.py /path/to/input
The argument is a single file containing the test data; the program
will read the parameters of a hidden Markov model from the file
hmmmodel.txt
, tag each word in the test data, and
write the results to a text file called hmmoutput.txt
in
the same format as the training data.
We will train your model, run your tagger on new test data, and compute the accuracy of your output compared to a reference annotation. Your grade will be the accuracy of your tagger, scaled to the performance of a reference HMM tagger developed by us. Since part-of-speech tagging can achieve high accuracy by using a baseline tagger that just gives the most common tag for each word, only the performance above the baseline will be scaled:
For example, if the baseline is 90%, the reference in 95%, and your accuracy is 93%, then your grade will be 0.9 + 0.1 × 0.03 / 0.05 = 96%.
hmmlearn.py
on the training
data and 5 seconds for running hmmdecode.py
on
the development data, running on a MacBook Pro from 2012.
All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.
Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The accuracy of you classifier will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.
If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.