CSCI 544 — Applied Natural Language Processing
Research Project
Updates (April 24)
- Project report assignment was set up on Crowdmark and sent
to team points of contact on April 21. Please follow the
template of one page per major section. There is no need to fill
each page with text, but each section must be adequately
addressed.
(April 3)
- Code demonstrations/presentations will take place on April 10 in
PHE 516.
- Each team must sign up for a presentation time slot; the link
to the sign-up sheet is on Piazza.
- Come prepared! You have only 5 minutes to demonstrate
your work, so practice how you will present it and what code you
want to show. You will not have time to show more than
2-3 pieces of the project, so choose in advance what you want to
demonstrate, and time your presentation appropriately.
- Proposals are due February 22, 2017, at 23:59 Pacific Time (11:59 PM)
- Code demonstrations/presentations will take place on April 10.
- Final reports are due April 28, 2017, at 23:59 Pacific Time (11:59 PM)
Overview
The research project is an in-depth activity that will be carried
out in teams of four. The project can be on any aspect of natural
language processing. You will formulate a research question, identify
resources and tools to address the question, implement and evaluate a
system that uses these resources and tools, demonstrate the system,
and write up a report.
- The research must involve a language other than English.
- The project must be a new effort, conducted specifically for
this class. You may build on and extend previous research, but this
project needs to add to that research, not just reuse old material.
- You must submit your code by putting it on a publicly available
repository (such as GitHub or
Bitbucket) and making the link
available, at least for the duration of the class.
- Your code may use external, publicly
available tools and resources, but only ones that you are able to
provide when you submit your code.
Procedure
- Form teams of four students; you may use the Piazza forum to
team up together based on interests such as language and topic. Each
team should have strengths in all of the following areas: theory,
data, coding, and writing.
- Once you have formed your team, add your information on the
team spreadsheet, and create a thread on Piazza to
discuss your project ideas with the instructor and the TAs.
In the discussion, identify the language and problem you’re working
on, the data and tools you will use, and the effort you will put
in. You should aim to receive feedback before writing up
your project proposal.
- Submit your project proposal by the deadline.
- Implement and evaluate your project.
- Demonstrate/present your code (details will be arranged later),
and submit your code through a public repository.
- Submit your final report by the deadline.
Proposal structure
The proposal describes your plan for the research project, and will
serve as the skeleton for the final report. As a plan it is subject to
change and does not represent a firm commitment, but it should show
that you’ve thought through the relevant aspects of your research.
The proposal should be a document of about
500 words, written in English in good academic
style. Proposals that substantially exceed this length (above
600 words) will be penalized. The structure of the document
should be as follows.
- Title for the project.
- Names, USC IDs, and USC emails of the team members.
- Introduction. Motivate a specific problem that you will consider,
which involves the use of real-world natural language data. Describe
the problem you are trying to solve, why it is interesting or
challenging, what existing work has been done, and how your
contribution relates to that.
- Method.
- Materials. Identify the source data that you use, such as a
specific corpus that you can get access to or collect yourself.
Describe the data in some detail, including the source, the amount
of data, what kinds of annotation it has or needs.
- Procedure. Describe what methods you will use to process your
data, algorithms, features, and tools, and what annotations you will
make (if needed).
- Evaluation. Describe how you will evaluate your system’s
performance, and your annotation procedure (if needed). What
measures will you use? What baseline system will you compare to?
- References cited.
- Division of labor between the teammates.
- Word count for the document.
The proposal should be written after you have received some
feedback about the general direction of your project. You will receive
written feedback about your proposal, which
should help you with writing the final report; however, feedback on
the proposal might take some time, so don’t delay
collecting your data and implementing your system while waiting for
comments on your proposal. For feedback on specific issues that arise
with the project, use Piazza.
Code demonstrations
Code demonstrations/presentations will take place on April 10
in PHE 516.
Each team will have a 5-minute slot to talk
about their work and demonstrate how their code works. No
presentation slides (there’s no time for that). The
code demo is a progress check; teams should have some working code to
show, but it is not expected to be a final version.
Final report
The final report describes the research you have done, reporting on
the method and results, relating the research to other work in the
field, and offering conclusions and directions for future work. The
report should be about 2000 words long, not counting the
references; reports that
substantially exceed this length will be penalized. The structure is
similar to the proposal, but with more detail, and two additional
sections following the method section.
- Title for the project.
- Names, USC IDs, and USC emails of the team members.
- Introduction. Motivate the problem that you have worked on,
which involves the use of real-world natural language data.
There is no need to motivate Natural Language Processing in general,
but rather your specific application.
Describe the problem, why it is interesting or
challenging, what existing work has been done, and how your
contribution relates to that.
- Method.
- Materials. Identify the source data that you
use, such as a specific corpus that you accessed or collected
yourself.
Describe the data in some detail, including the source, the amount
of data, what kinds of annotation it has, and what annotations
needed to be added (if any).
- Procedure. Describe the experimental
procedure, that is what methods you use to process the data.
This may include algorithms, features, and tools, and any
annotations you made. Well-known methods (such as Naive Bayes or
Conditional Random Fields) do not need to be explained, but you do
need to explain how you use them, for example the features you
choose. If you created a tool (such as an annotation or
visualization tool), describe it in some detail.
- Evaluation. Describe your method for
evaluating the system’s performance, and for evaluating your
annotation procedure (if any).
Include a description of the specific measures will you took,
and the baseline to which you compare your system.
- Results. Report how your system performs, and how it compares to
the baseline or to other comparable work. Discuss what it gets right,
what it gets wrong, and why.
- Discussion. Discuss conclusions that can be drawn from the
research, implications of your findings, the overall contribution to
the general NLP community, and directions for future research.
- References cited. You may choose your preferred style for
in-text citations (for example, numerical or author-year) and for
the reference listing, but please keep it consistent across the
document. The reference listing should contain all the information
required for accessing the reference – author(s), year,
title, and publication information (such as the conference, journal,
volume etc.).
- Division of labor between the teammates.
- Word count for the document, excluding references.
The six main content sections (introduction, materials, procedure,
evaluation, results, and discussion) carry equal weight. Therefore,
they should be of similar lengths – this means reserving
about 300–350 words for each section. This is only a general
guideline, as you may find that some sections require more text than
others. However, if you find you have more to say than fits within the
length requirement, then you’ll need to concentrate on the more
important aspects of your project.
When giving examples of text in languages other than English,
please use the following multi-line format, to make the examples
readable to English speakers. Below is an example for how to present a
sentence in Hindi.
किस | ने | दवाई | को | खरीदा |
(the original text in its native script) |
kis | ne | davaaii | ko | khariidaa |
(a transcription into Latin script) |
who | ERG | medicine | ACC | bought |
(a word-by-word gloss) |
‘Who bought the medicine?’ |
(a translation into English) |
The second line is not needed if the language natively uses a version of the
Latin script.
Grading
The grade for the assignment will be broken down as follows.
- 10% Discussion of the project with the instructor and TAs.
- 10% On-time submission of a coherent proposal.
- 10% Difficulty, creativity and originality.
- 20% Code demonstration and code quality.
- 50% Final report.
The research project counts for 30% of the overall course grade.