Python代写 | CMPT 456 Assignment 3


CMPT 456 Assignment 3
Due: 11:59 pm, July 17, 2020
100 points in total
Please submit your assignment in Coursys.
Every student has to complete the assignment independently. While you are encouraged to learn
through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious
violation of the university’s academic integrity policy. We have absolutely zero tolerance of such
This assignment covers the materials in the sections of queries and interfaces, and information
retrieval models. The questions use the data sets D1 and D2 in Q1 of Assignment 2. Please also
merge D1 and D2 to form data set D. Do NOT stem the data sets.
Question 1 (50 points)
This question asks you to build a simple spelling correction function.
Your program should take a word w as input. It searches all words x in D that have edit distance
edit-dist(w, x) <= 2, where each of the editing operations, insertion, deletion, and replacement,
takes weight 1. Then, your program should output the top-5 words x that have the highest termfrequency in D as the suggested corrections as well as their term-frequencies in D.
1. Describe your implementation of the above spelling correction function and use 3
examples to show the correction results. Particularly, explain how you can find those
candidate corrections and the top-5 candidates efficiently. Please submit your codes. (20
2. The above design has a problem. If you input a correctly spelt word, like “heart”, it still
may suggest some similar words as possible corrections, such as “heard”. Can you
propose a fix to the problem so that your program can avoid suggesting corrections of
words frequently used in D? Describe your idea use examples, implement it, and test the
results. Does your program need any parameters? If so, discuss how those parameters
may contribute to or affect the performance of the program, that is, how those
parameters are tuned. Please submit your codes. (10 points)
3. A data set of tweets may often contain misspelt words. Your spelling correction program
trained from D may be used on D to improve the data quality. Apply your program to D
itself to catch misspelt words and correct them. Show the corrections you program makes.
Does your program make some mistakes in correction? Use the first 100 corrections your
program makes (or as many as possible if your program makes less than 100 corrections)
to calculate the precision. Explain (1) how your program picks the correction of a word
among all possible candidates; and (2) how your program avoids changing every word in
D, particularly those words correctly spelt but infrequent in D. Submit your codes. (20
Question 2 (25 points)
This question asks you to use language modelsto understand the differences between the usages
of words in D1 and D2. Please remove the stop words in D1 and D2.
For a word w that appears in document D1 (D2), we collect all the tweets containing w and form
a document 𝑑𝑑!
“! (𝑑𝑑!
“” ). Compute the language model 𝑀𝑀!
“! (𝑀𝑀!
“” ) of 𝑑𝑑!
“! (𝑑𝑑!
“” ). Please use
Dirichlet smoothing, use D1 (D2) as the collection, and set 𝜇𝜇 = 25. For the top-100 most frequent
words in D, compute 𝐾𝐾𝐾𝐾(𝑀𝑀!
“” || 𝑀𝑀!
“!) and sort the words in KL-divergence descending order.
Please submit your list of the words, the KL-divergence values and your codes.
Question 3 (25 points)
This question asks you to conduct clustering on the tweets.
Let the tweets in D2 be {𝑡𝑡#
$, … ,𝑡𝑡%
$}, where n is the number of tweets in D2. For each tweet t
in D1, we can form a vector 𝑡𝑡⃗ = ⟨𝑥𝑥#, 𝑥𝑥$, … , 𝑥𝑥%⟩, where 𝑥𝑥& = 𝐾𝐾𝐾𝐾 5𝑀𝑀’|| 𝑀𝑀’#
“6, and 𝑀𝑀’ and 𝑀𝑀’#
” are
the language models of t and 𝑡𝑡&
$, respectively. Again, please use Dirichlet smoothing, use D1 (D2)
as the collection, and set 𝜇𝜇 = 25.
For any two tweets 𝑡𝑡# and 𝑡𝑡$ in D1, the similarity between them is the cosine similarity between
the two associated vectors 𝑡𝑡#777⃗ and 𝑡𝑡$777⃗ . Run the k-means algorithm to cluster all tweets in D1. You
can learn the k-means algorithm on pages 382-384 in the textbook. You can also use any
implementations or libraries, such as scikit in Python <>, to
implement the k-means algorithm. Report your results when k is set to 3, 5, and 10, respectively.
For each cluster, report the number of tweets and the top-5 most frequent words in the tweets
in the cluster. Submit your codes as well.


本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: [email protected]  微信:itcsdx