CS 486 – Machine Learning Fall, 2020
Assignment 01 – Clustering
There are four parts for this assignment, all detailed below:
1. Implement k-means as described above.
2. Extend k-means so that it balances the number of instances per cluster.
3. Run the clustering algorithms against some datasets and determine the performance of
each; compare performances of the algorithms.
4. Do a performance analysis between your implementation of K-Means (excluding the
extended version) and the version offered by the Sckkit-learn library. The dataset to be
used for the performance analysis is the one used for Lab-02-K-Means. Access this
dataset using this link.
Create a python-based implementation of the K-Means algorithm.
This implementation must be a subclass of cluster.py, available here. As such, it must
implement two member functions: __init__(…) and fit(…), as described below.
● __init__(…) must allow the class’ users to set the algorithm’s hyperparameters: k,
which is the target number of cluster centroids, and max_iterations, which is
maximum number of times to execute the convergence attempt (repeat loop in the
above Background section). The default values are required to be k = 5 and
max_iterations = 100.
● fit(…) must accept one parameter X, where X is a list (not columns of a
Dataframe) of n instances in d dimensions (features) which describe the n instances. A
successful call to the fit(…) function must return the following two items, in order:
A. A list (of length n) of the cluster hypotheses, one for each instance.
B. A list (of length at most k) containing lists (each of length d) of the cluster
For example, if the input (X) contains the following values in 2-dimensional space:
[ [0, 0], [2, 2], [0, 2], [2, 0], [10, 10], [8, 8], [10, 8], [8, 10] ]
… and k = 2, we expect the centroids should be [1, 1] and [9, 9]. The output of the fit(…)
function should be as follows:
A. [0, 0, 0, 0, 1, 1, 1, 1] — indicating that the first four instances belong to one cluster and
the second four belong to a different cluster.
B. [ [1, 1], [9, 9] ] — the values for the first and second centroid, respectively.
Test the python-based implementation using scikit-learn. Generate clusters using the
make_blobs function with the following commands:
from sklearn.datasets.samples_generator import make_blobs
X, cluster_assignments = make_blobs(n_samples=200, centers=4,
This will generate 200 instances of data points in 2-dimensional space, with each of the
instances belonging to one of 4 clusters. The coordinates for the 100 instances are returned as
CS 486 – Machine Learning – Fall 2020 – Assignment 1 2
X. The cluster assignments are returned as cluster_assignments. Use X as the parameter
to your fit(…) function listed above, and use cluster_assignments to determine
whether your implementation’s hypotheses are correct. (Given multiple — 10? — iterations of
your implementation with k=4, the values for X from the commands above should generate no
errors; however, the values in cluster_assignments may not align to the values from your
Please include a sample of your implementation’s output from the input as a .txt file.
Change your implementation to include an additional optional Boolean (True/False) argument,
balanced. The default value must be False. When balanced is set to True, the
implementation changes so that each of the k clusters are (roughly) equal with respect to the
number of instances per cluster — i.e. the implementation generates clusters of (roughly) the
same size. When balanced is set to False, the logic is the canonical K-Means, described in
the Background section.
Choose and run clustering algorithms
Execute one or more clustering algorithms (k-means, DBSCAN, Hierarchical, Spectral) against
the datasets below. Explain the following:
1. The reason why you chose the clustering algorithm(s)
2. Any pre-processing of the data or any hyperparameter settings
3. Output from the algorithm(s) — show what clusters were generated
4. The metrics you used to evaluate the output. What kind of performance did you get from
that algorithm? Is that what you expected?
Use the following datasets for this part:
● Chicago taxi data, an approximately week-long subset of the full dataset (which can be
found here). Use either the pickup or dropoff location coordinates.
● Finnish location data (taken from Mopsi data)
Please submit the Jupyter notebooks in which you performed your analyses (one notebook per
1. Fit K-Means as implemented in the scikit-learn toolkit on this dataset used in Lab 02.
2. Using your own version of K-Means (Implement K-Means section), fit the same dataset
from the above step.
3. Determine what differences there are between the results (outputs) of the two
implementations. Explain any differences in results.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx