Python代写 | CMPT 459 Assignment 3


CMPT 459 Assignment 3
Due: 11:59 pm, July 10, 2020
100 points in total
Please submit your assignment in Coursys.
Every student has to complete the assignment independently. While you are encouraged to learn
through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious
violation of the university’s academic integrity policy. We have absolutely zero tolerance of such
This assignment covers the materials in the section of pattern mining.
Question 1 (10 points)
Given a transaction database T, let 𝑥𝑥!, 𝑥𝑥”, … , 𝑥𝑥# be the k most frequent items. Prove, for any
length-k itemset Y, 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) ≤ 𝑚𝑚𝑚𝑚𝑚𝑚{ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥!), 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥”), … , 𝑠𝑠𝑠𝑠𝑠𝑠(𝑥𝑥#)} .
Question 2 (10 points)
Given a transaction database T, let X and Y be two itemsets such that 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) and 𝑋𝑋 ∩
𝑌𝑌 ≠ ∅. For example, X = abc and Y = cde. Does 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑌𝑌) = 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋 ∪ 𝑌𝑌) always hold?
If so, please give a mathematical proof; otherwise, give a counter example.
Questions 3 and 4 use the tweet data sets D1 and D2 formed in Assignment 1. You can build your
solutions to Questions 3 and 4 on top of any tools/codes you find on the web, as long as you
make proper references.
Question 3 (50 points)
In D1 and D2, treat each token as an item, and each tweet as a transaction. That is, we ignore
the order of tokens within a tweet. If a token appears multiple times in a tweet, keep only one
occurrence. Write a program to find the top 100 patterns of lengths 1, 2, 3, 4, and 5 for D1 and
D2, respectively. Here, a pattern of length k is a set of k tokens. Some patterns may have same
length and same support. Report all of them to make at least 100 patterns for each length. More
specifically, a pattern X of length k is a top-100 pattern if there do not exist 100 other patterns
𝑋𝑋!, … , 𝑋𝑋!$$, each of length k, such that 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋%) > 𝑠𝑠𝑠𝑠𝑠𝑠(𝑋𝑋) for 1 ≤ 𝑖𝑖 ≤ 100.
1. Describe your algorithm and implementation. (20 points)
2. Submit those frequent patterns and their supports. (10 points)
3. Plot a figure where the x-axis is the length k and the y-axis is the support of the most
frequent pattern at length k. Do the curves of D1 and D2 fit the power law distribution?
Try to estimate the parameters of the distribution. (20 points)
Question 4 (30 points)
For a pattern X, we are interested in 𝑜𝑜𝑜𝑜𝑜𝑜(𝑋𝑋) = &'(!”(*)
. Write a program to find the top 100
patterns of length up to 4 that have the highest odd values and 𝑠𝑠𝑠𝑠𝑝𝑝,#(𝑋𝑋) ≥ 5. If you cannot find
100 such patterns in your data sets, lower down 𝑠𝑠𝑠𝑠𝑝𝑝,#(𝑋𝑋) to a smaller value.
1. Describe your algorithm and implementation. (20 points)
2. Submit those patterns and their odds. (10 points)
Hints: think about which algorithm you would like to use, Apriori or FP-growth, and what further
pruning criteria you may employ to reduce your search space.


本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: [email protected]  微信:itcsdx