Comp 480/580: Mid-Term Exam
1 Trivia (True or False) 15 points (3 points each)
Make sure to provide short explanation.
1. Given a set S, our goal is to be able to compress S. We want to answer whether any given x is an
element of S or not. We can use bloom ﬁlters on S to compress it for this task.
2a. Given a randomvariable X, we know itsmean and variance E(X) and V ar(X). Given another random
variable Y , which is possibly correlated with X. We also know its mean and variance E(Y ) and V ar(Y ).
The above information is suﬃcient to compute the mean of X + Y.
2b. The information in 2a is suﬃcient to compute the variance of X + Y
3. We have two random variables X and Y (and they are not constants). It is possible to have situations
Y ) = E(X)
4. Count-min sketch returns an underestimate.
5. Given any LSH function h, where Pr(h(x) = h(y)) = pxy for any given x and y. We can always construct
another LSH g with probability Pr(g(x) = g(y)) = p3
2 Estimation and Minwise Hashing 10 points
Given two sets S1 and S2, minwise hashing given by h guarantees that the probability of hash agreement
is Jaccard similarity J, i.e.,
Pr(h(S1) = h(S2)) = jS1 \ S2j
jS2 [ S2j
In the class we showed that if we generate K independent minwise hashes of S1 and S2, then simply
counting the number of hash collisions, call it Nc, out of K gives us an estimator of J given by Nc
K . In
addition, let us say that the variance of this counting estimator is given by V = J(1 J)
1. Identify the random variable in the above problem. 1 point.
2. Given the values of the sizes of the sets, i.e., values of jS1j and jS2j (assume these sizes to be known
constants). Can we get an estimator for jS1 S2j (Number of elements in S1 but not in S2)? (only in terms
of Nc, K, jS1j, and jS2j). 5 points
3. What can we say about the variance of the estimator given in 2 above? (in terms of J, K, jS1j, and
jS2j) 4 points
Known property: V ariance(aX) = a2V ariance(X), where a is constant and X is any random variable.
3 Count-Min Sketch with weights 10 points
(Please see cheat-sheet for reference) Let us assume that every item i has a weight associated with it, call
it vi. We modify the count-min sketch algorithm as follows:
• Addition: Instead of “Each counter is initialized with zero, and every update ct
i (to item i at time t) is
added for all d rows to countersM(j; hj(i)), where j = f1; 2; :::; dg”, we update by multiplying every
update of item i with vi, i.e., for update ct
i we update the countersM(j; hj(i)), where j = f1; 2; :::; dg
with vi ct
• Querying: While estimating the count for element i, we use roughly the same estimation as count-
min sketch, except that we divide the estimate by vi. Essentially, a query for the count of item i
reports the minimum of the corresponding d counters (after division by vi) i.e., min
Note, this is generalization of count-min sketch. We recover count-min sketch if we put vi = 1 for all i
1. Consider the element j that has the largest weight, i.e., vj = maxivi. What can we say (more error
or less error) about the estimate of this largest element compared with the estimate of usual count-min
sketch (where all vi = 1) 4 points.
Questions: Maybe just consider what happens with only 1 hash function (i.e. d =1).
2. Consider the element j that has the smallest weight, i.e., vj = minivi. What can we say (in one
line) about its estimate of this element compared with the estimate of usual count-min sketch (where all
vi = 1) 4 points
3. In light of 1. and 2. above, can you describe (in few words) what do you think is going on. What is
the signiﬁcance of weights here? 2 points
4 Minwise Hashing (Advanced) 15 points
The proof of the fact
Pr(Minhash(S1) = Minhash(S2)) = jS1 \ S2j
jS2 [ S2j
follows from the following arguments.
1. Given any set S, Apply hash mapping h (assuming perfect hashing i.e. for two diﬀerent element
x = y, implies h(x) = h(y)), to every element of S. The element that gets the minimum value (call
it Minhash(S)) is a perfectly random draw of an element from S
2. Now consider set S1 [ S2. Apply hash mapping h to this union of two sets. Thus, we expect
Minhash(S1 [ S2) to give a random draw from S1 [ S2.
3. We haveMinhash(S1) = Minhash(S2) = Minhash(S1[S2), if an only if theminimumof h(S1[S2)
comes from a token belonging to S1\S2. Only when both the minimum values coming from the sets
h(S1) and h(S2) will be equal. Essentially, the collision event indicates a random draw from S1 \S2.
(Venn diagram may help visualise)
4. Thus, the collision event is a special event where Minhash(S1 [ S2) satisﬁes Minhash(S1) =
Minhash(S2) = Minhash(S1[S2). The event has jS1\S2j choices. The total choices forMinhash(S1[
S2) is jS1 [ S2j (any random draw irrespective of the collision).
A good idea is to look at standard Venn diagram in Figure 1.
1. Consider the case Minhash(S1) = Minhash(S2). What is the probability of this event? 2 points.
2. Can we come up with an expression of Pr(Minhash(S1) > Minhash(S2)), in terms of jS1j, jS2j,
jS1[S2j, and jS1\S2j? Can you instead use this to estimate Jaccard similarity J (Generate K independent
minwise hashes of S1 and S2 and …. (similar to Problem 2 deﬁne the estimator)) 8 points
3. Lets say we have three sets. S1, S2, and S3. What is the expression for Pr(Minhash(S1) =
Minhash(S2) = Minhash(S3)) (in terms of set deﬁnitions). Can you generalize the expressions if we
have instead n sets S1, S2, …, Sn and we are looking for Pr(Minhash(S1) = Minhash(S2) = ::: =
Minhash(Sn)) (Hint: Draw a venn diagram with three sets). 5 points
4 (Keep thinking, No points)Which estimator for J is better? The one derived in Problem2 previously
or derived in the second part of this problem.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: email@example.com 微信:itcsdx