This assignment counts for 5% of the course grade.
Assignments turned in after the deadline but before February 5
are subject to a 30% grade penalty.
Students who registered for the class on or after January 23
are not subject to the penalty.
In this assignment you will write a naive Bayes classifier to identify hotel reviews as either truthful or deceptive, and either positive or negative. You will be using the word tokens as features for classification. The assignment will be graded based on the performance of your classifiers, that is how well they perform on unseen test data compared to the performance of a reference classifier.
A ZIP file with training data will be made available on Blackboard. The data will consist of two files, plus readme and license files. The data files are in the following format:
train-text.txt
with a single training
instance (hotel review) per line. The first token in the each line is
a unique 20-character alphanumeric identifier, which is followed by
the text of the review.
train-labels.txt
with labels
for the corresponding reviews. Each line consists of three tokens:
a unique 20-character alphanumeric identifier corresponding to a
review, a label truthful
or deceptive
,
and a label positive
or negative
.
Each data file contains 1280 lines, corresponding to 1280 reviews.
You will write two programs: nblearn.py
will learn a
naive Bayes model from the training data, and
nbclassify.py
will use the model to classify new data. If
using Python 3, you will name your programs nblearn3.py
and nbclassify3.py
. The learning program will be invoked
in the following way:
> python nblearn.py /path/to/text/file /path/to/label/file
The arguments are the two training files; the program
will learn a naive Bayes model, and write the model parameters to a
file called nbmodel.txt
. The format of the model is up to
you, but it should contain the model parameters (that is, the various
probabilities) in a way that can be visually inspected (so no binary
files). You may use ordinary probabilities or log probabilities.
The classification program will be invoked in the following way:
> python nbclassify.py /path/to/text/file
The argument is the test data file, which has the same format as
the training text file. The program
will read the parameters of a naive Bayes model from the file
nbmodel.txt
, classify each entry in the test data, and
write the results to a text file called nboutput.txt
in
the same format as the label file from the training data.
We will train your model, run your classifier on unseen test data, and compute the F1 score of your output compared to a reference annotation for each of the four classes (truthful, deceptive, positive, and negative). Your grade will be as follows:
All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.
Multiple submissions are allowed, and your last submission will be graded. The submission script runs the program in a similar way to the grading script (but with different data), so you are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output. The performance of you classifier will be measured automatically; failure to format your output correctly may result in very low scores, which will not be changed.
Vocareum accounts are only made available to students who are registered for the class. Students who are on the waiting list are welcome to try the exercise on their own, but will not be able to submit until they have registered for the class.
If you have any issues with Vocareum with regards to logging in, submission, code not executing properly, etc., please contact Siddharth.