# Python代写 | CMPT 459 Assignment 3

CMPT 459 Assignment 3
Question 1 (10 points)
Given a transaction database T, let 𝑥𝑥!, 𝑥𝑥”, … , 𝑥𝑥# be the k most frequent items. Prove, for any
length-k itemset Y, 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) ≤ 𝑚𝑚𝑚𝑚𝑚𝑚{ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥!), 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥”), … , 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥#)} .
Question 2 (10 points)
Given a transaction database T, let X and Y be two itemsets such that 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) and 𝑋𝑋 ∩
𝑌𝑌 ≠ ∅. For example, X = abc and Y = cde. Does 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋 ∪ 𝑌𝑌) always hold?
If so, please give a mathematical proof; otherwise, give a counter example.
Questions 3 and 4 use the tweet data sets D1 and D2 formed in Assignment 1. You can build your
solutions to Questions 3 and 4 on top of any tools/codes you find on the web, as long as you
make proper references.
Question 3 (50 points)
In D1 and D2, treat each token as an item, and each tweet as a transaction. That is, we ignore
the order of tokens within a tweet. If a token appears multiple times in a tweet, keep only one
occurrence. Write a program to find the top 100 patterns of lengths 1, 2, 3, 4, and 5 for D1 and
D2, respectively. Here, a pattern of length k is a set of k tokens. Some patterns may have same
length and same support. Report all of them to make at least 100 patterns for each length. More
specifically, a pattern X of length k is a top-100 pattern if there do not exist 100 other patterns
𝑋𝑋!, … , 𝑋𝑋!\$\$, each of length k, such that 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋%) > 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) for 1 ≤ 𝑖𝑖 ≤ 100.
1. Describe your algorithm and implementation. (20 points)
2. Submit those frequent patterns and their supports. (10 points)
3. Plot a figure where the x-axis is the length k and the y-axis is the support of the most
frequent pattern at length k. Do the curves of D1 and D2 fit the power law distribution?
Try to estimate the parameters of the distribution. (20 points)
Question 4 (30 points)
For a pattern X, we are interested in 𝑜𝑜𝑜𝑜𝑜𝑜(𝑋𝑋) = &'(!”(*)
&'(!#(*) E-mail: [email protected]  微信:itcsdx 