# Python代写 | COMPSCI 753 Algorithms for Massive Data

COMPSCI 753 Algorithms for Massive Data

Assignment 1: Locality-sensitive Hashing
1 Assignment problem (50 pts)
The assignment aims at investigating MinHash and Locality-sensitive Hashing (LSH)
framework on real-world data sets. In class, we have seen how to construct signature
matrices using random permutations. However, in practice, random permutation on
very large matrix is prohibitive. Section 3.3.5 of your textbook (Chapter 3, Mining of
Massive Datasets1 by J. Leskovec, A. Rajaraman, J. Ullman) introduces a simple but
fast method to “simulate” this randomness using different hash functions. We encourage
you to read through that section before attempting the assignment.
In the assignment, you write a program2
to compute all pairs similarity on the “bag
of words” data set from the UCI Repository3 using the Jaccard similarity. This problem is the core component for detecting plagiarism and finding similar documents in
information retrieval.
The bag of words data set contains 5 text datasets which share the same pre-processing
procedure. That is, after tokenization and removal of stopwords, the vocabulary of
unique words was truncated by only keeping important words that occurred more than 10
times for large data sets. For small data sets, there was not truncation.
It has the format: docID wordID count, where docID is the document ID, wordID
is the word ID in the vocabulary, and count is the word frequency. Since the Jaccard
similarity does not take into account the word frequency, we simply ignore this information in the assignment. This means that we consider count = 1 for each pair (docID,
wordID). We consider a document as a set and each word as a set element, and make
use the Jaccard similarity.
We only make use the KOS blog entries data set for this assignment since we still need
to run bruteforce algorithm to measure the accuracy of MinHash and LSH. However,
students are encouraged to try on larger data sets, such as NYTimes news article and
PubMed abstracts. Note that the data set is very sparse, so you might think of the
relevant data structures for fast processing.

E-mail: [email protected]  微信:itcsdx