Python代写 | Assignment 3 for CS 6120 (Natural Language Processing)

本次Python代写是完成自然语言处理过程

Assignment 3 for CS 6120 (Natural Language Processing)

1. Vector Space Model [40 points]
Your goal in this part of the assignment is to understand how vector space models work. We will
provide you a corpus of Shakespeare plays and you will use it to create a term-document matrix
and a term-context matrix. You will use the term-document matrix to find Shakespeare plays
that are similar to each other and the term-context matrix to find words that are most similar to
some sample words.
You should download the following artifacts:
This is a CSV file and each line here has 6 columns. I suggest that you use a “CSV reader”
to read this file. The only columns of interest for this assignment are:
(a) Column 2: which contains the “document name”
(b) Column 6: Which contains the “one line in your document”
For example, here is one line from the CSV file:
The only columns from this example line that you need to use for this assignment are:
(a) Column 2: Henry IV
(b) Column 6: And breathe short-winded accents of new broils
1.1 Construct Term-Document Matrix [10 points]
Your goal in this part of the assignment is to build a term document matrix from your corpus.
Section 6.3 in the reading material from Lecture 5 describes how to do this.
1. Read the CSV file and process it so that you only keep Column 2 and Column 6 for
further processing. Note that it is important for you to convert all the words in Column 6
to lower case before further processing.
2. Read the “list of plays” file to get the document names.
3. Read the “vocab file” to get all words in your vocab.
4. Construct the term-document matrix as described in Section 6.3 by treating Column 2 in
your corpus as “document name” and Column 6 as “one line in your document”. In the
term-document matrix, the columns represent “documents” and rows represent
“words”. So, the number of columns in your matrix should be the same as the number of
plays in the “list of plays” file and number of rows should be the same as the number of
words in your “vocab file”.
5. Once you have the term-document matrix, print any 5 words with non-zero frequency
from each play along with their frequency. The output should be in a file called
“term_doc_sample.txt” in the following format:
name_of_play, word, frequency

name_of_play, word, frequency
1.2 Compute Document Similarity [10 points]
Your goal in part of the assignment is to compute document similarity based on cosine
similarity. Section 6.4 in the reading material from Lecture 5 describes how to compute cosine
similarity.
1. Treat each column in your term-document matrix from Section 1.1 of this assignment as
a vector representing a play, i.e., document whose dimensionality is the same as the size
of your “vocab”.
2. Implement a function that takes as input two vectors and returns the cosine similarity
between them. Section 6.4 in the reading material from Lecture 5 describes how to
compute cosine similarity.
3. For each play in the corpus find its cosine similarity with all other plays.
4. For each play in the corpus, print the name of the play which has the highest cosine
similarity. Print the output in a file called “doc_sim.txt” in the following format:
name_of_play, name_of_most_similar_play, cosine_similarity
name_of_play, name_of_most_similar_play, cosine_similarity

name_of_play, name_of_most_similar_play, cosine_similarity
1.3 Measuring Word Similarity using Term-Context Matrix [10 points]
Your goal in this part of the assignment is to find 5 most similar words to each of the words
“romeo”, “juliet”, “nobleman”, “caesar”, and “friend”. To do this you will have to build a
term-context matrix. Section 6.3.2 in the reading material from Lecture 5 describes how to build
a term-context matrix.
1. Read the CSV file and process it so that you only keep Column 2 and Column 6 for
further processing. Note that it is important for you to convert all the words in Column 6
to lower case before further processing.
2. Read the “list of plays” file to get the document names. This step in optional for this part
of the assignment.
3. Read the “vocab file” to get all words in your vocab.
4. Construct the term-context matrix as described in Section 6.3.2 by treating Column 6 in
your corpus as “one line in your document”.
a. For each word, look at 4 words to its left (when available) and 4 words to its right
(when available) and treat them as context.
b. Each row in your term-context matrix represents a word vector for the
corresponding word and each column corresponds to context word.
c. Note that in your term-context matrix the number of rows and columns should
both be the same as the number of words in your “vocab file”
5. Use the cosine similarity function that you implemented in Section 1.2 of this
assignment to compute cosine similarity between the vectors for the words “romeo”,
“juliet”, “nobleman”, “caesar”, and “friend” with vectors for all other words in your
vocab. Print the top-5 most similar words to the words “romeo”, “juliet”, “nobleman”,
“caesar”, and “friend” in a file called “term_context_sim.txt” in the following format:
target_word, similar_word, cosine_similarity
target_word, similar_word, cosine_similarity

target_word, similar_word, cosine_similarity
1.4 TF-IDF in the Term-Context Matrix [10 points]
Your goal in this part of the assignment is to find 5 most similar words to each of the words
“romeo”, “juliet”, “nobleman”, “caesar”, and “friend” using TF-IDF weighting. To do this you will
use the term-context matrix that you built in Section 1.3 of this assignment and add TF-IDF
weighting to it. Section 6.5 in the reading material from Lecture 5 describes TF-IDF. However,
for this assignment, we are going to use a simple form of TF-IDF as described below and not the
advanced form that they have in the book.
1. Read the CSV file and process it so that you only keep Column 2 and Column 6 for
further processing. Note that it is important for you to convert all the words in Column 6
to lower case before further processing.
2. Read the “list of plays” file to get the document names.
3. Read the “vocab file” to get all words in your vocab.
4. For each word in your vocab, count the number of documents in your corpus that this
word occurs in by treating Column 6 in your corpus as “one line in your document”. This
is your document frequency, i.e., DF for each word in your vocabulary. Using this, you
can compute IDF simply by this formula:
IDF =
1
DF
5. Use the term-context matrix that you build in Section 1.3 of this assignment as the
starting point. Remember that column in your term-context matrix represents a context
word. For each cell in your matrix, multiply the existing term frequency from Section 1.3
with the IDF of the context word that you computed in Step 4. Note that, it is important
that you multiply by the IDF of the context word , i.e., of the word representing the
column.
6. Use the cosine similarity function that you implemented in Section 1.2 of this
assignment to compute cosine similarity between the vector for the word “romeo”,
“juliet”, “nobleman”, “caesar”, and “friend” and vectors for all other words. Print the
top-5 most similar words to the words “romeo”, “juliet”, “nobleman”, “caesar”, and
“friend” in a file called “tf_idf_sim.txt” in the following format:
target_word, similar_word, cosine_similarity
target_word, similar_word, cosine_similarity

target_word, similar_word, cosine_similarity
2. Neural Sentiment Classification [35 points]
Your goal for this part of the assignment is to build a binary sentiment classifier using a
feed-forward neural network. The training data consists of 1133 IMDB movie reviews, where
each review is rated either as positive or negative. These reviews have been separated into
folders for “pos” for positive reviews and “neg” for negative reviews. You can download this
training data from here:
2.1 Train a feed-forward Neural Network [10 points]
Your goal for this part of the assignment is to train a feed-forward neural network with 1 hidden
layer for sentiment classification.
1. Treat only word unigrams as features for your neural network classifier.
2. Build a neural network with an input layer, one hidden layer, and an output layer.
3. Train your classifier using the training data.
4. Use 10-fold cross validation to optimize parameters using “accuracy” as the metric. For
example, you can choose the activation function and choose the number of nodes in
your hidden layer.
5. Describe your parameter optimization process and report the parameters for your best
model.
2.2 Test your feed-forward Neural Network [5 points]
Your goal for this part of the assignment is to test your neural network on the “training set”.
1. Use the parameters from best performing model in Section 2.1 of this assignment and
train the neural network on your whole training corpus.
2. Report your accuracy on the entire training set.
2.3 Feed-forward Neural Network with More Hidden Layers [10 points]
Your goal for this part of the assignment is to train a feed-forward neural network with 2 hidden
layers for sentiment classification and report its accuracy on the training set.
1. Treat only word unigrams as features for your neural network classifier.
2. Build a neural network with an input layer, 2 hidden layers, and an output layer. Fix the
number of number of nodes in your second hidden layer as 10.
3. Train your classifier using the training data.
4. Use 10-fold cross validation to optimize parameters using “accuracy” as the metric. For
example, you can choose the activation function and choose the number of nodes in
your first hidden layer.
5. Use the parameters from best performing model and train this neural network on your
whole training corpus.
6. Report the parameters for your best model.
7. Report your accuracy on the entire training set.
2.4 Neural Network classification on the Test Set [10 points]
Your goal for this part of the assignment is to run your best performing neural network classifier
– depending on results from Section 2.2 and 2.3 – on the test set.
2. Classify each review in the test set as either positive of negative using your best
performing classifier – either from Section 2.2 or from Section 2.3.
3. Create a file “pos.txt” which contains the “file names” of all test set reviews that your
classifier classified as positive. Similarly, create a file “neg.txt” which contains the “file
names” of all test set reviews that your classifier classified as negative. Note that you
should keep the file names in the test set unchanged.
4. Make sure that the files “pos.txt” and “neg.txt” have one file name per line, i.e., the
format of the files should be the following:
FILE_NAME
FILE_NAME