CSCI 544 — Applied Natural Language Processing
Coding Exercise 1
Due: Tuesday, February 2, at 23:59 Pacific Time (11:59 PM)
This assignment counts for 5% of the course grade.
Assignments turned in after the deadline but before
Friday, February 5, are subject to a 20% grade penalty.
- Update 2021-01-23: Due date set to February 2;
links added to explanations about the data;
note added about using word lists and a recommended work flow.
- Update 2021-01-21: Clarified that the test and key files should not be
uploaded into your personal directory.
Overview
Person names in the English language typically consist of one or
more forenames followed by one or more surnames (optionally preceded
by zero or more titles and followed by zero or more suffixes). This
situation can create ambiguity, as it is often unclear whether a
particular name is a forename or a surname. For example, given the
sequence Imogen and Andrew Lloyd Webber, it is not
possible to tell what the full name of Imogen is, since that would
depend on whether Lloyd is part of Andrew’s
forename or surname (as it turns out, it is a surname:
Imogen
Lloyd Webber is the daughter of
Andrew
Lloyd Webber). This exercise explores ways of dealing with this
kind of ambiguity.
You will write a program that takes a string representing the names
of two persons (joined by and), and tries to predict the full
name of the first person in the string. To develop your program, you
will be given a set of names with correct solutions: these are not
names of real people – rather, they have been constructed
based on lists of common forenames and surnames. The names before the
and are the first person’s forenames, any titles they
may have, and possibly surnames; the names after the and are
the second person’s full name. For each entry, your program will
output its best guess as to the first person’s full name. The
assignment will be graded based on accuracy, that is the number of
names predicted correctly on an unseen dataset constructed the same
way.
Data
A set of development data is available as a
compressed ZIP archive: coding-1-dev-data.zip.
The uncompressed archive contains the following files:
- A test file with a list of conjoined names.
- A key file with the same list of conjoined names, and the
correct full name of the first person for each example.
- Three lists of name frequencies from U.S. census data (these
lists are available on the web, and included in the package for your
convenience).
- A readme and license file, which will not be used for the exercise.
Not included in the package are explanations about the U.S. Census Bureau lists:
Explanation of the 1990 tables;
Explanation of the 2010 tables.
The submission script will run your program on the test file and
compare the output to the key file. The grading script will do the
same, but on a different pair of test and key files which you have not
seen before.
Program
You will write a program called full-name-predictor.py
in Python 3 (Python 2 has been deprecated),
which will take the path to the test file as a command-line
argument. Your program will be invoked in the following way:
> python full-name-predictor.py /path/to/test/data
The program will read the test data, and write its answers to a
file called full-name-output.csv
. The output file must be
in the same format of the key file.
Submission
All submissions will be completed through
Vocareum;
please consult the instructions
for how to use Vocareum.
Multiple submissions are allowed; only the final submission will be
graded. Each time you submit, a submission script is invoked, which
runs the program on the test data.
Do not include the test or key files in your submission: the submission script
reads the test file from a central directory, not from your
personal directory.
You should only upload your program file to Vocareum, that is
full-name-predictor.py
;
if your program uses auxiliary files (for example, lists
of common names), then you must also include these in your personal
directory.
You are encouraged to submit early and
often in order to iron out any problems, especially issues
with the format of the final output.
The output of your program will be
graded automatically; failure to format your output correctly may
result in very low scores, which will not be changed.
For full credit, make sure to submit your assignment well before
the deadline. The time of submission recorded by the system is the
time used for determining late penalties. If your submission is
received late, whatever the reason (including equipment failure and
network latencies or outages), it will incur a late penalty.
Grading
After the due date, we will run your program on a unseen test data,
and compare your output to the key to that test data. Your grade will
be the accuracy of your output, scaled to the output of a predictor
developed by the instructional staff (so if, for example, that
predictor has an accuracy of 90%, then an accuracy of 90% or above
will receive full credit, and an accuracy of 81% will receive
90% credit).
Notes
- Even the best models are likely to fall short of 100% accuracy;
the reason is that some names are highly ambiguous between forenames
and surnames.
To take two examples from the development data, both Douglas
and James are reasonably likely as either forenames and surnames;
as it turns out, George Douglas and… happen to be two
forenames, whereas Washington James and… happen to be
a forename and a surname. It is difficult to predict this based on
distributional properties alone.
- You are allowed to use the name lists included in the package.
While these are also available on the web, you
cannot use any code that fetches these
lists (or anything else) from the web at runtime; rather, you should
use the local copy.
You may also find it useful to preprocess these lists into a form
that is more useful for your program. Whatever you do, if your
program requires any auxiliary files such as name lists, you must
include these in your Vocareum submission.
- The name lists are strictly optional: you are not required to
use them at all. Preliminary results reached by the instructional
team are:
- 85% accuracy on the development data without using the name
lists at all.
- 89% accuracy on the development data using only the first
column in the lists of forenames (names only, no numbers, and no
surnames).
The above results may still improve as we refine the reference
solution. Our recommendation is that students start working on the
problem without using the word lists, see how far they get, and what
kind of errors the program makes; this will help identify what kind
of information is missing, and whether such information can be
extracted from the word lists.
- You may use machine learning (if you wish), but if you learn
properties from the key file, it would be better to do this learning
separately, and then store the outcome (for example, parameter
values) in the submitted program. You will not be able to do any
online learning with the key file for the grading data, because that
file will not be provided.
- While there are many properties that can be learned from the key
file, some are better to not learn from
it – for example, the distribution of forenames and
surnames. As explained in the overview section, names in the
development data were sampled from name lists. While it is possible
to learn a distribution of names from the key file, such a
distribution will be less accurate than learning the distribution
from the source lists provided in the readme file. Other properties
(for example, the distribution of single and double names) can be
estimated from the key file, with a caveat that for some names in
the data, we may not be able to determine conclusively whether they
are forenames or surnames.
- There is no list of titles corresponding to the lists of
forenames and surnames. The best we can do with titles is to guess
what the possible titles are based on the given data.
Collaboration and external resources
- This is an individual assignment. You may
discuss ideas for the solution with other students in class or on
class message boards, but you may not work in teams
or share any code with other students. You must be the sole author of 100%
of the code you turn in.
- You may not look for solutions on the web, or use code you find
online or anywhere else.
- You may not download the data from any source other than the
files provided on Blackboard, and you may not attempt to locate the
test data on the web or anywhere else.
- You may use packages in the Python Standard Library. You may not
use any other packages.
- You may use external resources to learn basic functions of
Python (such as reading and writing files, handling text strings, and
basic math), but the extraction and computation of models must be
your own work.
- Failure to follow the above rules is considered a violation of
academic integrity,
and is grounds for failure of the assignment, or
in serious cases failure of the course.
- We use plagiarism detection software to identify
similarities between student assignments, and between student
assignments and known solutions on the web.
Any attempt to fool plagiarism detection, for
example the modification of code to reduce its similarity to the
source, will result in an automatic failing grade for the
course.
- Please discuss any issues you have on the Piazza discussion
boards. Do not ask questions about the assignment by email; if we
receive questions by email where the response could be helpful for
the class, we will ask you to repost the question on the discussion
boards.