2 Question 1: Clustering (16 marks)
The following sub-questions are about clustering. In general, the topics covered are as follows: *
Question 1.1 focuses on clustering algorithmic behaviour, and their sensitivity to data. * Question
1.2 focuses on method for estimating the number of clusters. * Question 1.3 uses a large real-world
dataset to look at how clustering can be used for knowledge discovery.
The following reading is likely to be of use for the following questions: Note that certain sections may be
useful, you are not expected to read it all!
For this question, we will use multiple datasets, which can all be found on Blackboard. To load
these datasets in, we can use the following code. Note that you may need to adjust the path to the
dataset, depending on where they are located on your system.
2.1 Q1.1 (6 marks)
2.1.1 Q1.1a (2 marks)
Using simple.csv, run the K-Means and single-linkage algorithms (available in scikit-learn) on
the dataset, using the true number of clusters (K = 5). Produce a graph (e.g. bar chart) showing
the performance (measured using the adjusted rand index) across 10 independent runs. Discuss the
results obtained, using your knowledge of how the algorithms work to explain why the behaviour
Hints: * This question is more difficult without the use of error bars! * For singlelinkage, use AgglomerativeClustering(n_clusters=5, linkage=”single”) * For KMeans, use
KMeans(n_clusters=5, init=”random”, n_init=1) as arguments.
2.3 Q1.3 (5 marks)
For this question, we will use the online_retail_full.csv dataset, which is a real-world dataset
of transactions for an online retail store. Full information about the dataset can be found here.
Here, we do not have true labels, and need to explore the data instead. This is a common scenario
in practice, and will require you explore the data and use clustering (likely requiring multiple
iterations and tweaks) to try to find patterns.
We’re going to investigate whether there are groups of customers, how they are similar, and what
they may represent. For simplicity, we will start by using KMeans as our model, and we’ll remove
some of the columns from our input data. Use a range of K values and whichever techniques in
Q1.2 are useful to propose interesting K value(s). Comment on the clusters that are produced in
terms of the context of the data.
Hints: * As this dataset has no truth, there is a lot of scope in this question – remember to have
some justification for why you have taken the steps you have. * The quality of your final clusters
is not important for marks, as long as you have taken reasonable steps. * The overall aim is to
try to find patterns in the data. KMeans is suggested as a starting point, but it is not always the
best algorithm to use as we have seen in previous questions. * You can create features from the
existing ones. For example, the quantity and price can be multiplied to get a total amount (thus
simplifying the data). Other features may require transformation before they can be used.
3 Question 2: Itemset Rule Mining (4 marks)
For this question, we will be using a real-world dataset which gives the votes of 435 U.S. congressmen
on 16 key issues gathered in the mid-1980s, and also includes their party affiliation as a binary
attribute. This is a purely nominal dataset with some missing values (corresponding to abstentions).
It is normally treated as a classification problem, the task being to predict party affiliation based
on voting patterns. However, association-rule mining can also be applied to this data to seek
We will be using Weka, both for its utility for itemset rule mining, and to use a different approach
for exploring data. You should have some experience using Weka from the first (non-assessed)
You may need to take screenshots of Weka and include them in your answer below, or copy & paste
the relevant rules. Please ensure that your answer and rules are clearly legible.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx