University of Southern California

CSCI 544 — Applied Natural Language Processing


Coding Exercise 1

Due: Tuesday, February 2, at 23:59 Pacific Time (11:59 PM)

This assignment counts for 5% of the course grade.

Assignments turned in after the deadline but before Friday, February 5, are subject to a 20% grade penalty.


Overview

Person names in the English language typically consist of one or more forenames followed by one or more surnames (optionally preceded by zero or more titles and followed by zero or more suffixes). This situation can create ambiguity, as it is often unclear whether a particular name is a forename or a surname. For example, given the sequence Imogen and Andrew Lloyd Webber, it is not possible to tell what the full name of Imogen is, since that would depend on whether Lloyd is part of Andrew’s forename or surname (as it turns out, it is a surname: Imogen Lloyd Webber is the daughter of Andrew Lloyd Webber). This exercise explores ways of dealing with this kind of ambiguity.

You will write a program that takes a string representing the names of two persons (joined by and), and tries to predict the full name of the first person in the string. To develop your program, you will be given a set of names with correct solutions: these are not names of real people – rather, they have been constructed based on lists of common forenames and surnames. The names before the and are the first person’s forenames, any titles they may have, and possibly surnames; the names after the and are the second person’s full name. For each entry, your program will output its best guess as to the first person’s full name. The assignment will be graded based on accuracy, that is the number of names predicted correctly on an unseen dataset constructed the same way.

Data

A set of development data is available as a compressed ZIP archive: coding-1-dev-data.zip. The uncompressed archive contains the following files:

Not included in the package are explanations about the U.S. Census Bureau lists: Explanation of the 1990 tables; Explanation of the 2010 tables.

The submission script will run your program on the test file and compare the output to the key file. The grading script will do the same, but on a different pair of test and key files which you have not seen before.

Program

You will write a program called full-name-predictor.py in Python 3 (Python 2 has been deprecated), which will take the path to the test file as a command-line argument. Your program will be invoked in the following way:

> python full-name-predictor.py /path/to/test/data

The program will read the test data, and write its answers to a file called full-name-output.csv. The output file must be in the same format of the key file.

Submission

All submissions will be completed through Vocareum; please consult the instructions for how to use Vocareum.

Multiple submissions are allowed; only the final submission will be graded. Each time you submit, a submission script is invoked, which runs the program on the test data. Do not include the test or key files in your submission: the submission script reads the test file from a central directory, not from your personal directory. You should only upload your program file to Vocareum, that is full-name-predictor.py; if your program uses auxiliary files (for example, lists of common names), then you must also include these in your personal directory.

You are encouraged to submit early and often in order to iron out any problems, especially issues with the format of the final output.

The output of your program will be graded automatically; failure to format your output correctly may result in very low scores, which will not be changed.

For full credit, make sure to submit your assignment well before the deadline. The time of submission recorded by the system is the time used for determining late penalties. If your submission is received late, whatever the reason (including equipment failure and network latencies or outages), it will incur a late penalty.

Grading

After the due date, we will run your program on a unseen test data, and compare your output to the key to that test data. Your grade will be the accuracy of your output, scaled to the output of a predictor developed by the instructional staff (so if, for example, that predictor has an accuracy of 90%, then an accuracy of 90% or above will receive full credit, and an accuracy of 81% will receive 90% credit).

Notes

Collaboration and external resources