NLP代写|NLP Data Modelling Assignment 4


1. Natural Language Processing (Logistic Regression & Naïve Bayes)

a. The file “assignment_4.txt” contains text data with labels indicating sentiment.

The end of each line contains an “@” followed by its label (positive, neutral, and
negative). Load the file line by line to create a dataframe with two columns “text”
and “label”. This code snippet should help you getting started:
with open(“assignment_4.txt”, “r”) as fin:

for line in fin:
text, label = line.strip().rsplit(“@”, 1)

What is the distribution of labels in the full data?

b. Next, split the data randomly into a training (70%) and a test set (30%). Use
stratified sampling to roughly retain the original distribution of labels for both
training and test data.

c. Using sklearn create a binary CountVectorizer() and
TfidfVectorizer(). Use the original single words as well as bigrams. Also,
use an “english” stop word list. Fit these to the training data to extract a
vocabulary and then transform both the train and test data (keep a copy of the
original train and test data for later).

d. Create LogisticRegression() and BernoulliNB() models. For all
settings, keep the default values. In a single plot, show the ROC curve for both
classifiers and both the binary and tf-idf feature sets. In the legend, include the
area under the ROC curve (AUC). Do not forget to label your axes. Your final plot
will be a single window with 4 curves.

Which model do you think does a better job?

2. Natural Language Processing (BERT)

a. Use the original training and test data of the previous section (before any

b. Install the transformers library from huggingface:
pip install transformers[torch]

c. Download a pre-trained BERT for Sequence Classification model
ication) You can use bert-base-uncased for both the tokenizer and the

d. Train the model using the original training data from above. (Here is a tutorial
from huggingface: but
we will also provide a (training in) pytorch tutorial session.

Note, that you will need an evaluation subset from the training data. Don’t use the
test data for this. Train for at least 5 epochs and monitor whether the training and
evaluation losses decrease. Increase epochs as needed (if the loss flattens out
you can stop training; if training takes long because of training on CPU you can
stop early).

e. Plot the training and evaluation loss, accuracy, and AUC over time (each pair of
training and evaluation in their own plots). Observe how your training converges.

f. Compute the ROC curve for the final model on the test data including its AUC.
Compare it to the ROC curves of the previous section by plotting the ROC curves
of the best model of the previous section and the current model.

g. Create a confusion matrix on the test data for the BERT model.

3. Model Explanations

a. Install the shapley python package (

b. Compute shapley explanations for your model from the previous section. For this,
pick three example inputs from each cell in the confusion matrix (skip if a cell has
0 entries; you will have to compute the shapley explanations for at most 27

c. Do the explanations match your intuition of which part of a sentence contributes
to the final output? Looking at examples where the model failed to predict
correctly, try to use the explanations to formulate why the model failed.

d. Create shapley explanations, like in subsection (b), on the Naïve Bayes model
from above. Note, that you have to use a different approach for computing the
explanations when using tabular data as input. Do you notice a change in how
the two models (NB vs. BERT) utilize their inputs? Can you use the explanations
to infer a likely reason why one of the models performs better than the other?