Python代写 | Comp 480/580: Assignment #1

本次Python代写是完成哈希函数和布隆过滤

Comp 480/580: Assignment #1

1 Testing Hash Functions [Optional: Either 1 or 2] 20 Points
Avalanche Analysis: A common test of hash function performance is whether or not it achieves “avalanche.”
This refers to the desirable characteristic that
P(Output bit i changes | Input bit j changes) = 0.5 ∀i, j
We are going to analyze three hash functions. Lets fix the range to 10 bits, and P = 1048573 Pick,
a,b,c,d uniformly between [1−1048573], and store them. (you need to report the numbers you generated)
Your hash functions h(x) are
1. ((ax + b) mod P) mod 1024 (2-universal)
2. ((ax2 + bx + c) mod P) mod 1024 (3-universal)
3. ((ax3 + bx2 + cx + d) mod P) mod 1024 (4-universal)
4. murmurhash3 with a fixed seed (use murmurhash from sklearn.utils import murmurhash3_32. Feel
free to take extra mod to constrain it in range [0-1023])
Randomly generate 5000 positive integers 31-bit keys (values of x) to start with. Design your experiment to estimate the probability values (you need to flip bits now).
Create a 2 dimensional heatmap : On x-axis, we have bits in the input. For every input bit j, we have
to calculate 10 values, which is the probability
P(Output bit i changes | Input bit j changes) = 0.5 ∀i, j
Find the most convenient way to have this heatmap. (value of 0.5 should be dark for best visualization).
Write a paragraph on the plot and your conclusion.
3 Inequalities 20 points
Extend the proof in the class for 5-independent hash functions. Prove that linear probing has O(1) expected cost for searching with load factor α = 1/3.
You can assume from the lecture notes that E(cost of searching) = O(1)Plog n
i=1 2
s×P r[Bs ≥ 2∗E[Bs]),
where Bs is a random variable that denotes the number of elements (our of total elements) that maps into
a given region of size 2
s
, using the 5-independent hash function.
Hint 1: Use 4
th Moment Bound:
P r(|Bs − E[Bs]| ≥ a) ≤
4thMoment(Bs)
a
4
,
where
4thMoment(Bs) = E[(Bs − E(Bs))4
]
Hint 2: 5-Independence implies, with Xi as defined in class, that E[XiXjXkXl
] = E[Xi
]E[Xj ]E[Xk]E[Xl
]
for any quadruples [Xi
, Xj , Xk, Xl
]
4 Implement and Test Bloom Filters 40 points
Build your own Bloom Filter
The instructions are given in python for demonstration. If using any other language, feel free to
choose equivalent alternatives. Never hesitate to ask question, when in doubt
In class we saw the basic Bloom filter, where we used k independent random hash-functions h1, h2, …, hk
to hash a set S of N elements into an array A of R bits. Recall that the formulas for computing the best k
to use for a given key set size M and maximum false positive rate at given range,
k =
R
N
ln2 (1)
with false positive rate
f p = 0.618
R
N (2)
4.1 Warmup Tasks
• 1. Implement a simple hash function factory. Given an argument m, the desired hash table size,
the factory should return a hash function that maps integers into a table of that length. (Or use
murmurhash from sklearn.utils import murmurhash3_32 )
• 2. Implement a Bloom filter as a class. A Bloom filter’s primary constructor receives two arguments:
the desired false positive rate, c, and the expected number of keys to store, n. You should not use
arrays but instead use bitmaps. You can make use of an example of bitarrays in python from here
url and make sure that your range is power of 2.
2
• 3. You will generate two dataset: a membership set: a list of 10,000 unique integers selected randomly from the range [10000..99999], and a test set: a list of 1000 unique integers not in the membership set and 1000 selected randomly from the membership test. Demonstrate your Bloom filter
for each of these false positive rates:
0.01
0.001
0.0001
In all cases, insert the items in the membership set into a Bloom filter using its primary constructor.
Then look up all the items in the test and compute the false positive rate.