Python代写 | CSC 485H/2501H: Computational linguistics, Fall 2020

本次Python代写是完成语义消除,消除歧义的单词及其出现的句子

CSC 485H/2501H: Computational linguistics, Fall 2020

0. Warming up with WordNet and NLTK
WordNet is a lexical database; like a dictionary, WordNet provides definitions and example
usages for different senses of word lemmas. But WordNet does the job of a thesaurus as well: it
provides synonyms for the senses, grouping synonymous senses together into a set called a synset.
But wait; there’s more! WordNet also provides information about semantic relationships beyond synonymy, such as antonymy, hyperonymy/hyponymy, and meronymy/holonymy. Throughout this assignment, you will be making use of WordNet via the NLTK package, so the first step
is to get acquainted with doing so. Consult sections 4.1 and 5 of chapter 2 as well as section
3.1 of chapter 3 of the NLTK book for an introduction along with examples that you will likely
find useful for this assignment. You may also find section 3.6 is also useful for its discussion of
lemmatization, although you will not be doing any lemmatization for this assignment.
(a) A root hyperonym is a synset with no hyperonyms. A synset s is said to have depth d if there
are d hyperonym links between s and a root hyperonym. Keep in mind that, because synsets
can have multiple hyperonyms, they can have multiple paths to root hyperonyms.
Implement the deepest function in q0.py that finds the synset in WordNet with the
largest maximum depth and report both the synset and its depth on each of its paths to a
root hyperonym. (Hint: you may find the wn.all_synsets and synset.max_depth
methods helpful.)
(b) Implement the superdefn function in q0.py that takes a synset s and returns a list consisting of all of the tokens in the definitions of s, its hyperonyms, and its hyponyms. Use
word_tokenize as shown in chapter 3 of the NLTK book.
(c) WordNet’s word_tokenize only tokenizes text; it doesn’t filter out any of the tokens.
You will be calculating overlaps between sets of strings, so it will be important to remove
stop words and any tokens that consist entirely of punctuation symbols.
Implement the stop_tokenize function in q0.py that takes a string, tokenizes it using
word_tokenize, removes any tokens that occur in NLTK’s list of English stop words
(which has already been imported for you), and also removes any tokens that consist entirely
of punctuation characters. For a list of punctuation symbols, use Python’s punctuation
characters from the string module (this has also already been imported for you). Keep
in mind that NLTK’s list contains only lower-case tokens, but the input string to stop_
tokenize may contain upper-case symbols. Maintain the original case in what you return.
1. The Lesk algorithm & word2vec
Recall the problem of word sense disambiguation (WSD): given a semantically ambiguous
word in context, determine the correct sense. A simple but surprisingly hard-to-beat baseline
method for WSD is Most Frequent Sense (MFS): just select the most frequent sense for each
ambiguous word, where sense frequencies are provided by some corpus.
(a) Implement the mfs function that returns the most frequent sense for a given word in a sentence. Note that wordnet.synsets() orders its synsets by decreasing frequency.
As discussed in class, the Lesk algorithm is a venerable method for WSD. The Lesk algorithm
variant that we will be using for this assignment selects the sense with the largest largest number of
words in common with the ambiguous word’s sentence. This version is called the simplified Lesk
algorithm.
Algorithm 1: The simplified Lesk algorithm.
input : a word to disambiguate and the sentence in which it appears
best_sense ←− most_frequent_sense word
best_score ←− 0
context ←− the set of word tokens in sentence
for each sense of word do
signature ←− the set of word tokens in the definition and examples of sense
score ←− Overlap(signature, context)
if score > best_score then
best_sense ←− sense
best_score ←− score
end
end
return best_sense
(b) In the lesk function, implement the simplified Lesk algorithm as specified in Algorithm 1,
including Overlap. Overlap(signature, context) calculates the number of words
that signature and context have in common, i.e., the cardinality of the intersection of
the two sets.1 Use your stop_tokenize function to tokenize the examples and definitions.
Next, we’re going to extend the simplified Lesk algorithm so that the sense signatures are more
informative.
(c) In the lesk_ext function, implement a version of Algorithm 1 where, in addition to including the words in sense’s definition and examples, signature also includes the words in
the definition and examples of sense’s hyponyms, holonyms, and meronyms. Beware that
NLTK has separate methods to access member, part, and substance holonyms/meronyms;
use all of them. Use stop_tokenize as you did for lesk.
(d) This extension should yield improvement in the algorithm’s accuracy. Why is this extension
helpful? Justify your answer.2
Beyond Overlap,there are other scores we could use. Recall cosine similarity from the lectures: for vectors ~v and ~w with angle θ between them, the cosine similarity CosSim is defined
as:
CosSim(~v,~w) = cosθ =
~v ·~w
|~v||~w|
Cosine similarity can be applied to any two vectors in the same space. In the Lesk algorithm,
we compare contexts with sense signatures, both of which are sets of words. If, instead of sets, we
produced vectors from the relevant sources (i.e., the words in the sentence for the contexts and the
words in the relevant definitions and examples for the sense signatures), we could then use cosine
similarity to score the two.
Perhaps the simplest technique for constructing vectors in this type of scenario is to treat the
relevant sources (sentence, definitions, etc.) as bags or multisets of words. Bags or like sets but
allow for repeated elements; the bag {a,a,b} is different than the bag {a,b}. (This is not the case
for sets.) Perhaps the simplest technique for assigning vectors to bags of words is to assign one
vector dimension for every word, setting the value for each dimension to the number of occurrences
of the associated word in the bag. So {a,a,b} might be represented with the vector
2 1
and
{a,b} with
1 1
. If we were comparing {new,bu f f alo, york} and {bu f f alo,bu f f alo,like}, we
might use
1 0 1 1
and
2 1 0 0
, respectively.
(e) In the lesk_cos function, implement a variant of your lesk_ext function that uses
CosSim instead of Overlap. You will have to modify signature and context so that
they are vector-valued; construct the vectors from the relevant tokens for each in the manner
described above. (Again, use stop_tokenize to get the tokens for the signature.)
(f) Suppose that, instead of using word counts as values for the vector elements, we instead used
binary values, so that {new,bu f f alo, york} and {bu f f alo,bu f f alo,like} would be represented with
1 0 1 1
and
1 1 0 0
, respectively. This is a vector representation of
a set.
If we use CosSim for such vectors, how would this be related to Overlap?
3
(You do not
need to implement this.)
Finally, let’s try to incorporate modern word vectors in place of the bag of words–based method
above. Relatively simple models such as the skip-gram model of word2vec can be trained on large
amounts of unlabelled data; because of the large size of their training data, they are exposed to
many more tokens and contexts. Once trained, word vectors can be extracted from the model and
used to represent words for other tasks, usually bestowing substantial increases to performance.
They also seem to exhibit some interesting semantic properties; you may have heard that if we
take the vector for “king”, subtract the vector for “man”, add the vector for “woman”, and then ask
which existing word vector is closest to the result, the answer will be the vector for “queen”. It
stands to reason that incorporating these vectors might help improve the Lesk algorithm.