CSCI 544 — Applied Natural Language Processing
Research Project
Due dates
- March 2, 2020: Proposal and article selection
- March 24/31, 2020: Article presentations (in class)
- April 6, 2020: Code demonstration video
- April 21, 2020: Poster presentations (in class)
- May 1, 2020: Final implementation and final report
Overview
The research project is an in-depth activity that will be carried
out in teams of four. The project can be on any aspect of natural
language processing. You will formulate a research question, identify
resources and tools to address the question, implement and evaluate a
system that uses these resources and tools, demonstrate the system,
and write up a report.
- The research must involve a language other than English.
- The project must be a new effort, conducted specifically for
this class. You may build on and extend previous research, but this
project needs to add to that research, not just reuse old material.
- You must submit your code by putting it on a
repository (such as GitHub or
Bitbucket) and making the link
accessible to the instructional staff, at least for the duration
of the class (you may choose to make it public or private).
- Your code may use external, publicly
available tools and resources, but only ones that you are able to
provide when you submit your code.
Procedure
- Form teams of four students; you may use the Piazza forum to
team up together based on interests such as language and topic. Each
team should have strengths in all of the following areas: theory,
data, coding, and writing.
- Once you have formed your team, make a private post on Piazza
with the names of the team members and a team name; we will create a
group on Piazza to
discuss your project ideas with the instructor and the course producers.
In the discussion, identify the language and problem you’re working
on, the data and tools you will use, and the effort you will put
in.
Also use this to discuss your choice of article for presentation.
You should aim to receive feedback before writing up
your project proposal.
- Submit your project proposal by the deadline.
- Present your research article in class at your assigned time.
- Implement and evaluate your project.
- Submit a demonstration video of your code,
and submit your code through a public repository.
- Present a poster of your work in class on the assigned date.
- Submit your final report by the deadline.
Proposal structure
The proposal describes your plan for the research project, and will
serve as the skeleton for the final report. As a plan it is subject to
change and does not represent a firm commitment, but it should show
that you’ve thought through the relevant aspects of your research.
The proposal should be a document of about
500 words, written in English in good academic
style. Proposals that substantially exceed this length (above
600 words) will be penalized. The structure of the document
should be as follows.
- Title for the project.
- Names, USC IDs, and USC emails of the team members.
- Introduction. Motivate a specific problem that you will consider,
which involves the use of real-world natural language data. Describe
the problem you are trying to solve, why it is interesting or
challenging, what existing work has been done, and how your
contribution relates to that.
- Method.
- Materials. Identify the source data that you use, such as a
specific corpus that you can get access to or collect yourself.
Describe the data in some detail, including the source, the amount
of data, what kinds of annotation it has or needs.
- Procedure. Describe what methods you will use to process your
data, algorithms, features, and tools, and what annotations you will
make (if needed).
- Evaluation. Describe how you will evaluate your system’s
performance, and your annotation procedure (if needed). What
measures will you use? What baseline system will you compare to?
- References cited.
- Division of labor between the teammates.
- Word count for the document, excluding references.
The proposal should be written after you have received some
feedback about the general direction of your project.
You will receive
written feedback about your proposal, which
should help you with writing the final report; however, feedback on
the proposal might take some time, so don’t delay
collecting your data and implementing your system while waiting for
comments on your proposal.
Each project team has selected an original research article to present to
the class, related to their research project.
Time slots for the presentation are as follows (note that class starts
at 3:30 PM as usual – I will use the first 10 minutes of
class for announcements; also, there will be a lecture following the
presentations).
March 24
- 3:40 PM Chef BERT: Jacob Devlin, Ming-Wei
Chang, Kenton Lee, and Kristina Toutanova (2019): BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. NAACL.
- 3:55 PM Jambo2020: David
M. Blei, Andrew Y. Ng, and Michael I. Jordan (2001): Latent
Dirichlet Allocation. NIPS 14.
- 4:10 PM Linja: Xiaoya
Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li
(2019): Is Word Segmentation Necessary for Deep Learning of Chinese
Representations? ACL.
- 4:25 PM Char-Aadmi: Rada Mihalcea and
Paul Tarau (2004): TextRank: Bringing Order into Text. EMNLP.
- 4:40 PM NLP-grams: Shashi Narayan, Shay
B. Cohen, and Mirella Lapata (2019): What is this Article about?
Extreme Summarization with Topic-aware Convolutional Neural
Networks. Journal of Artificial Intelligence Research 66: 243–278.
- 4:55 PM Team Phoenix: Ramesh Nallapati, Bowen
Zhou, Cicero dos Santos, Çağlar Gu̇lçehre, and Bing Xiang (2016): Abstractive Text
Summarization Using Sequence-to-Sequence RNNs and Beyond. CoNLL.
- 5:10 PM Suits: Aishwarya
Padmakumar and Akanksha Saran: Unsupervised Text Summarization
Using Sentence Embeddings. University of Texas report.
March 31
- 3:40 PM Jors: Kaitao
Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu (2019): MASS: Masked
Sequence to Sequence Pre-training for Language Generation. ICML.
- 3:55 PM RASR: Sayan
Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, and
Stefan Scherer (2017): Affect-LM: A Neural Language Model for
Customizable Affective Text Generation. ACL.
- 4:10 PM MindReese: Yunsu Kim, Petre
Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney (2019):
Pivot-based Transfer Learning for Neural Machine Translation between
Non-English Languages. EMNLP.
- 4:25 PM Unnamed team: Guillaume Lample, Alexis
Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato (2018): Unsupervised
Machine Translation Using Monolingual Corpora Only. ICLR.
- 4:40 PM Jarvis: Jiasen
Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra (2017):
Best of Both Worlds: Transferring Knowledge from Discriminative
Learning to a Generative Visual Dialog Model. CVPR.
- 4:55 PM NLI: Sean Welleck,
Jason Weston, Arthur Szlam, Kyunghyun Cho (2019): Dialogue Natural
Language Inference. ACL.
April 7
- 3:40 PM Luo times Chen Cube: Francesco
Barbieri, Jose Camacho-Collados, Francesco Ronzano, Luis
Espinosa-Anke, Miguel Ballesteros, Valerio Basile, Viviana Patti, and
Horacio Saggion (2018): SemEval 2018 Task 2: Multilingual Emoji
Prediction. SemEval.
- 3:55 PM Language Scientists: Md
Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya (2016). Aspect
based Sentiment Analysis in Hindi: Resource Creation and
Evaluation. LREC.
- 4:10 PM Lambda: Siyuan Zhao,
Yaqiong Zhang, Xiaolu Xiong, Anthony Botelho, and Neil Heffernan
(2017): A Memory-Augmented Neural Model for Automated Grading. ACM
Conference on Learning at Scale.
- 4:25 PM Oracle: Efthimios
Gianitsos, Thomas Bolt, Pramit Chaudhuri, Joseph Dexter (2019):
Stylometric Classification of Ancient Greek Literary Texts by
Genre. Workshop on Computational Linguistics for Cultural Heritage,
Social Sciences, Humanities and Literature.
- 4:40 PM Music Analysis: Braja Gopal
Patra, Dipankar Das, and Sivaji Bandyopadhyay (2017): Retrieving Similar
Lyrics for Music Recommendation System. ICON.
- 4:55 PM Unnamed team: You Jin Kim, Yun
Gyung Cheong, and Jung Hoon Lee (2019): Prediction of a Movie’s
Success From Plot Summaries Using Deep Learning Models. Second
Workshop on Storytelling.
Presenting your article
- You have a slot of 15 minutes, including transition to the
next team. Plan on presenting for about 10 minutes, leaving
about 4 minutes for questions. Practice your presentation to
make sure it fits within the allotted time.
- You may use presentation slides. If you do, please
upload
your slides (in PDF format) the day before your presentation, so
that your slides can be presented from the instructor's computer if
necessary.
- The point of the presentation is to have a conversation about
the research described in the article. Any slides you prepare should
support this conversation; but the conversation should revolve about
the content of the research, not about the slides.
- Concentrate on presenting the main idea of the article.
- Present enough details and results to support the main idea,
but don’t get into such detail that the main idea gets lost.
- Assume knowledge of all the methods and techniques covered in
class, but not much beyond that; if the article uses a new method or
technique, explain it.
- Explain the linguistic side of what the article is trying to
accomplish; you will probably also need to explain a little about
the language that is the object of study (unless it is English).
- As of 2020-03-11, the university has announced that
all classes until April 14 will be held online; this includes
all dates scheduled for student presentations. We will use Zoom to
hold classes online, as directed by the university, including both
the presentations and the following lecture.
Submit a 5-minute video which explains the work, demonstrates how
the code works, and gives an idea of how you intend to proceed. Also
submit a link to the code repository.
The code demo is a progress check; you should have some working code to
show, but it is not expected to be a final version.
To submit the video, just put it somewhere on the web that is
accessible, and submit the URL as a note on Piazza. One good option is
to use Zoom with cloud recording, which should be available if you use
your USC Zoom account. With Zoom, all teammates can speak and share
screens on the same recording; just remember that it may take Zoom
several hours to process a recording, and you want to leave time to
review the final recording and retake if necessary.
Create a poster presentation of your work, including preliminary
results. We will hold a poster session online, where you will
present your posters to yor classmates and the instructor, and view
the posters of all of your classmates.
- Poster format:
- Landscape orientation
- 16:9 aspect ratio or thereabouts
- Even the smallest characters need to be read clearly when the
poster fills a standard HD computer monitor (1920x1080 pixels).
- Concentrate on the main point, such as your research results. It
is OK to present tentative results, or expected results –
whatever you have. If you are not clear on what your results are,
think what is the one thing you want the audience to remember from
your research, and make that your main point.
- Minimize the amount of text, use clear graphics, and use the
visual structure of your poster. More tips for poster design are in
the Zoom lecture from 2020-04-14 and the accompanying slide
presentation on Blackboard.
Procedure for the online poster session:
- The session will take place on a Discord server; the link to
access the server has been published on Piazza.
- The server has 20 text channels and 20 voice channels: one
“general” channel of each type, and one channel of
each type for each team.
- The idea is that the channels are like the people congregating
around a poster. So everyone who wants to discuss team X’s
poster will be on the the team X channel. People can move freely
between channels. Each person can only be at one voice channel at
a time.
- Every team should post their poster to the respective text
channel by the beginning of class. You can attach a file by
clicking the “+” button at the left of the message
box.
- Each team should monitor their own channels to see if someone
wants to discuss their poster. You’re not expected to be at
your poster the whole time, but someone from the team should keep
an eye out for new messages, answer text questions within a
reasonable amount of time, and be ready to move to the voice
channel if someone wants to talk.
- The class will start at 3:30 as usual. We will not take a
scheduled break: instead we will finish 20 minutes early, at
6:30. Students may take breaks as they see fit, but please
coordinate with your teammates so that your poster/channel is
monitored by someone throughout the class.
- We will all start on the general voice channel at 3:30, and
then split to viewing and presenting posters shortly thereafter.
- I will visit all the posters, which gives me a little less
than 10 minutes for each poster, on average; some visits may be
longer and some shorter. Students should visit whichever posters
interest you.
- I understand that due to connectivity issues, time zone
differences, and other difficulties, some students may not be able
to attend the entire poster session. Please coordinate with your
teammates to make sure the poster is presented. If you find that
such issues prevent the presentation of the poster, please let me
know ahead of time.
The final report describes the research you have done, reporting on
the method and results, relating the research to other work in the
field, and offering conclusions and directions for future work. The
report should be about 2000 words long, not counting the
references; reports that
substantially exceed this length will be penalized. The structure is
similar to the proposal, but with more detail, and two additional
sections following the method section.
- Title for the project.
- Names, USC IDs, and USC emails of the team members.
- Introduction. Motivate the problem that you have worked on,
which involves the use of real-world natural language data.
There is no need to motivate Natural Language Processing in general,
but rather your specific application.
Describe the problem, why it is interesting or
challenging, what existing work has been done, and how your
contribution relates to that.
- Method.
- Materials. Identify the source data that you
use, such as a specific corpus that you accessed or collected
yourself.
Describe the data in some detail, including the source, the amount
of data, what kinds of annotation it has, and what annotations
needed to be added (if any).
- Procedure. Describe the experimental
procedure, that is what methods you use to process the data.
This may include algorithms, features, and tools, and any
annotations you made. Well-known methods (such as Naive Bayes or
Conditional Random Fields) do not need to be explained, but you do
need to explain how you use them, for example the features you
choose. If you created a tool (such as an annotation or
visualization tool), describe it in some detail.
- Evaluation. Describe your method for
evaluating the system’s performance, and for evaluating your
annotation procedure (if any).
Include a description of the specific measures will you took,
and the baseline to which you compare your system.
- Results. Report how your system performs, and how it compares to
the baseline or to other comparable work. Discuss what it gets right,
what it gets wrong, and why.
- Discussion. Discuss conclusions that can be drawn from the
research, implications of your findings, the overall contribution to
the general NLP community, and directions for future research.
- References cited. You may choose your preferred style for
in-text citations (for example, numerical or author-year) and for
the reference listing, but please keep it consistent across the
document. The reference listing should contain all the information
required for accessing the reference – author(s), year,
title, and publication information (such as the conference, journal,
volume etc.).
- Division of labor between the teammates.
- Word count for the document, excluding references.
The six main content sections (introduction, materials, procedure,
evaluation, results, and discussion) carry equal weight. Therefore,
they should be of similar lengths – this means reserving
about 300–350 words for each section. This is only a general
guideline, as you may find that some sections require more text than
others. However, if you find you have more to say than fits within the
length requirement, then you’ll need to concentrate on the more
important aspects of your project.
When giving examples of text in languages other than English,
please use the following multi-line format, to make the examples
readable to English speakers. Below is an example for how to present a
sentence in Hindi.
किस | ने | दवाई | को | खरीदा |
(the original text in its native script) |
kis | ne | davaaii | ko | khariidaa |
(a transcription into Latin script) |
who | ERG | medicine | ACC | bought |
(a word-by-word gloss) |
‘Who bought the medicine?’ |
(a translation into English) |
The explanations on the right (in parentheses) are part of the
instructions: they do not need to be repeated with the example.
The second line (transcription into Latin script) is not needed if the
language natively uses a version of the Latin script.
Grading
The grade for the assignment will be broken down as follows.
- 10% Discussion of the project with the instructor.
- 10% On-time submission of a coherent proposal.
- 10% Difficulty, creativity and originality.
- 10% Research article presentation.
- 10% Code demonstration and code quality.
- 10% Poster presentation.
- 40% Final report.
The research project counts for 30% of the overall course grade.