Python代写 | 60711-Cwk3-S-Third
本次Python代写是完成类聚与数据挖掘相关的问题
60711-Cwk3-S-Third
2 Question 1: Clustering (16 marks)
 The following sub-questions are about clustering. In general, the topics covered are as follows: *
 Question 1.1 focuses on clustering algorithmic behaviour, and their sensitivity to data. * Question
 1.2 focuses on method for estimating the number of clusters. * Question 1.3 uses a large real-world
 dataset to look at how clustering can be used for knowledge discovery.
 The following reading is likely to be of use for the following questions:  Note that certain sections may be
 useful, you are not expected to read it all!
 For this question, we will use multiple datasets, which can all be found on Blackboard. To load
 these datasets in, we can use the following code. Note that you may need to adjust the path to the
 dataset, depending on where they are located on your system.
 2.1 Q1.1 (6 marks)
 2.1.1 Q1.1a (2 marks)
 Using simple.csv, run the K-Means and single-linkage algorithms (available in scikit-learn) on
 the dataset, using the true number of clusters (K = 5). Produce a graph (e.g. bar chart) showing
 the performance (measured using the adjusted rand index) across 10 independent runs. Discuss the
 results obtained, using your knowledge of how the algorithms work to explain why the behaviour
 observed occurred.
 Hints: * This question is more difficult without the use of error bars! * For singlelinkage, use AgglomerativeClustering(n_clusters=5, linkage=”single”) * For KMeans, use
 KMeans(n_clusters=5, init=”random”, n_init=1) as arguments.
 2.3 Q1.3 (5 marks)
 For this question, we will use the online_retail_full.csv dataset, which is a real-world dataset
 of transactions for an online retail store. Full information about the dataset can be found here.
 Here, we do not have true labels, and need to explore the data instead. This is a common scenario
 in practice, and will require you explore the data and use clustering (likely requiring multiple
 iterations and tweaks) to try to find patterns.
 We’re going to investigate whether there are groups of customers, how they are similar, and what
 they may represent. For simplicity, we will start by using KMeans as our model, and we’ll remove
 some of the columns from our input data. Use a range of K values and whichever techniques in
 Q1.2 are useful to propose interesting K value(s). Comment on the clusters that are produced in
 terms of the context of the data.
 Hints: * As this dataset has no truth, there is a lot of scope in this question – remember to have
 some justification for why you have taken the steps you have. * The quality of your final clusters
 is not important for marks, as long as you have taken reasonable steps. * The overall aim is to
 try to find patterns in the data. KMeans is suggested as a starting point, but it is not always the
 best algorithm to use as we have seen in previous questions. * You can create features from the
 existing ones. For example, the quantity and price can be multiplied to get a total amount (thus
 simplifying the data). Other features may require transformation before they can be used.
 3 Question 2: Itemset Rule Mining (4 marks)
 For this question, we will be using a real-world dataset which gives the votes of 435 U.S. congressmen
 on 16 key issues gathered in the mid-1980s, and also includes their party affiliation as a binary
 attribute. This is a purely nominal dataset with some missing values (corresponding to abstentions).
 It is normally treated as a classification problem, the task being to predict party affiliation based
 on voting patterns. However, association-rule mining can also be applied to this data to seek
 interesting associations.
 We will be using Weka, both for its utility for itemset rule mining, and to use a different approach
 for exploring data. You should have some experience using Weka from the first (non-assessed)
 week.
 You may need to take screenshots of Weka and include them in your answer below, or copy & paste
 the relevant rules. Please ensure that your answer and rules are clearly legible.