CSCI 544 — Applied Natural Language Processing

Homework 2

Due: February 1, 2016, at 23:59 Pacific Time (11:59 PM)

This assignment counts for 5% of the course grade.

Assignments turned in after the deadline but before February 5 are subject to a 30% grade penalty.
Students who registered for the class on or after January 23 are not subject to the penalty.

Overview

In this assignment you will write a naive Bayes classifier to identify hotel reviews as either truthful or deceptive, and either positive or negative. You will be using the word tokens as features for classification. The assignment will be graded based on the performance of your classifiers, that is how well they perform on unseen test data compared to the performance of a reference classifier.

Data

A ZIP file with training data will be made available on Blackboard. The data will consist of two files, plus readme and license files. The data files are in the following format:

A text file train-text.txt with a single training instance (hotel review) per line. The first token in the each line is a unique 20-character alphanumeric identifier, which is followed by the text of the review.
A label file train-labels.txt with labels for the corresponding reviews. Each line consists of three tokens: a unique 20-character alphanumeric identifier corresponding to a review, a label truthful or deceptive, and a label positive or negative.

Each data file contains 1280 lines, corresponding to 1280 reviews.

Programs

You will write two programs: nblearn.py will learn a naive Bayes model from the training data, and nbclassify.py will use the model to classify new data. If using Python 3, you will name your programs nblearn3.py and nbclassify3.py. The learning program will be invoked in the following way:

> python nblearn.py /path/to/text/file /path/to/label/file

The arguments are the two training files; the program will learn a naive Bayes model, and write the model parameters to a file called nbmodel.txt. The format of the model is up to you, but it should contain the model parameters (that is, the various probabilities) in a way that can be visually inspected (so no binary files). You may use ordinary probabilities or log probabilities.

The classification program will be invoked in the following way:

> python nbclassify.py /path/to/text/file

The argument is the test data file, which has the same format as the training text file. The program will read the parameters of a naive Bayes model from the file nbmodel.txt, classify each entry in the test data, and write the results to a text file called nboutput.txt in the same format as the label file from the training data.

Grading

We will train your model, run your classifier on unseen test data, and compute the F1 score of your output compared to a reference annotation for each of the four classes (truthful, deceptive, positive, and negative). Your grade will be as follows:

90% of your grade will be based on the performance of your classifier. We will calculate the mean of the four F1 scores and scale it to the performance of a naive Bayes classifier developed by the instructional staff (so if that classifier has F1=0.8, then a score of 0.8 will receive a full credit for this part, and a score of 0.72 will receive 90% credit).
10% of your grade will be a visual inspection of your model file, to make sure that it contains model parameters.

Notes

Development data. While developing your programs, you should reserve some of the data as development data in order to test the performance of your programs. The submission script on Vocareum will use a 75%-25% split (75% of the training data will be used for training, and 25% of the training data will be used for development testing). While developing on your own you may use different splits of the data. The grading script will use the full training set for training, and unseen data for testing.
Problem formulation. You may treat the problem as two binary classification problems (truthful/deceptive and positive/negative), or as a 4-class single classification problem. Choose whichever works better.
Smoothing and unknown tokens. You should implement some method of smoothing for the training data and a way to handle unknown vocabulary in the test data, otherwise your programs won’t work. The reference solution will use add-one smoothing on the training data, and will simply ignore unknown tokens in the test data. You may use more sophisticated methods which you implement yourselves.
Tokenization. You’d need to develop some reasonable method of identifying tokens in the text (since these are the features for the naive Bayes classifier). Some common options are removing certain punctuation, or lowercasing all the letters. You may also find it useful to ignore certain high-frequency or low-frequency tokens. You may use any tokenization method which you implement yourselves. Experiment, and choose whichever works best.

Collaboration and external resources

This is an individual assignment. You may not work in teams or collaborate with other students. You must be the sole author of 100% of the code you turn in.
You may not look for solutions on the web, or use code you find online or anywhere else.
You may not download the data from any source other than the files provided on Blackboard, and you may not attempt to locate the test data on the web or anywhere else.
You may use external resources to learn basic functions of Python (such as reading and writing files, handling text strings, and basic math), but the extraction and computation of model parameters, as well as the use of these parameters for classification, must be your own work.
Failure to follow the above rules is considered a violation of academic integrity, and is grounds for failure of the assignment, or in serious cases failure of the course.
Please discuss any issues you have on the Piazza discussion boards. Do not ask questions about the assignment by email; if we receive questions by email where the response could be helpful for the class, we will ask you to repost the question on the discussion boards.

Submission

All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.

Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The performance of you classifier will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.

Vocareum accounts are only made available to students who are registered for the class. Students who are on the waiting list are welcome to try the exercise on their own, but will not be able to submit until they have registered for the class.

If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.