CSCI 544 — Applied Natural Language Processing


Homework 2

Due: February 4, 2016, at 23:59 Pacific Time (11:59 PM)

Assignments turned in after the deadline but before February 7 are subject to a 30% grade penalty. Students who registered for the class after January 25 are not subject to the penalty.

Deadline extended to February 4 due to server problems on the original deadline.


Overview

In this assignment you will write a naive Bayes classifier to identify hotel reviews as either truthful or deceptive, and either positive or negative. You will be using the word tokens as features for classification. The homework will be graded based on the performance of your classifiers, that is how well they perform on unseen test data compared to the performance of a reference classifier.

Data

A set of training data will be made available as a compressed ZIP archive on Blackboard. The uncompressed archive will have the following format:

The grading script will train your model on all of the training data, and test the model on unseen data in a similar format. The directory structure and file names of the test data will not reveal the true labels of the individual test files.

Programs

You will write two programs: nblearn.py will learn a naive Bayes model from the training data, and nbclassify.py will use the model to classify new data. If using Python 3, you will name your programs nblearn3.py and nbclassify3.py. The learning program will be invoked in the following way:

> python nblearn.py /path/to/input

The argument is the directory of the training data; the program will learn a naive Bayes model, and write the model parameters to a file called nbmodel.txt. The format of the model is up to you, but it should contain sufficient information for nbclassify.py to successfully classify new data.

The classification program will be invoked in the following way:

> python nbclassify.py /path/to/input

The argument is the directory of the test data; the program will read the parameters of a naive Bayes model from the file nbmodel.txt, classify each file in the test data, and write the results to a text file called nboutput.txt in the following format:

label_a label_b path1
label_a label_b path2

In the above format, label_a is either “truthful” or “deceptive”, label_b is either “positive” or “negative”, and pathn is the path of the text file being classified.

Grading

We will train your model, run your classifier on new test data, and compute the F1 score of your output compared to a reference annotation for each of the four classes (truthful, deceptive, positive, and negative). Your grade will be the mean of the four F1 scores, scaled to the performance of a naive Bayes classifier developed by the TAs (so if that classifier has F1=0.8, then a score of 0.8 will receive a grade of 100%, and a score of 0.72 will receive a grade of 90%).

Notes

Collaboration and external resources

Submission

All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.

Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The accuracy of you classifier will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.

Vocareum accounts are only made available to students who are registered for the class. Students who are on the waiting list are welcome to try the exercise on their own, but will not be able to submit until they have registered for the class.

If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.