Python代写 | CMPSC 448: Machine Learning and AI HW4

本次美国CS代写主要是使用Python机器学习完成分类、聚类的实现

Instruction

CMPSC 448: Machine Learning and AI

Homework 4 (Sunday April 11, 11:59 PM)

This HW includes both theory and implementation problems. Please note,

  • Your code must work with Python 3.5+ (you may install the Anaconda distribution of Python)
  • You need to submit a report in PDF including all written deliverable and plots via Gradescope, and all implementation codes via Canvas so one could regenerate your results.
  • For non-linear classifier problem, you need to submit a Jupyter notebook for each algorithm and all complementary python codes so one could reproduce your results. Submit your code in Problem4*.py and replace * witn the name of classifier.
  • For clustering problem, you are NOT allowed to use the implementation from scikit-learn or import it and should submit your own implementation. Submit your code in Problem5.py.

    The submission for this homework should be a single PDF file via Gradescope and a single ‘zip‘ file containing all of the relevant code, figures, and any text explaining your results via Canvas.

    Kernel Methods

    Problem 1 (10 points) In this problem you are asked to find the explicit mapping function Φ(·) for the Gaussian kernel k (x, x′) = exp ?− ∥x − x′∥2 /2γ2? by showing that it can be expressed as the inner product of an infinite-dimensional feature space. Assume x, x ∈ Rd and ∥x∥2 = ∥x′∥2 = 1.

    Problem 2 (20 points) In the lectures we used the KKT optimally conditions to drive the dual optimization problem for hard SVM. In this problem you are asked to do the same for soft SVM. Consider the constrained optimization problem for soft SVM:

1 2 ?n min ∥w∥ +C ξi

w,b,ξ1,ξ2,…,ξn 2 i=1

subjecttoyi?w⊤xi +b?≥1−ξi,i=1,2,…,n ξi ≥ 0,i = 1,2,…,n

a) Consider two sets of dual variables: α1, . . . , αn for first set of constraints, i.e., yi ?w⊤xi + b? ≥ 1 − ξi and another set of dual variables β1, . . . , βn for second set of constrains ξi ≥ 0 and write down the Lagrangian functions.

b) For the Lagrangian function in part (a), write down the KKT optimally conditions. c) Derive the dual problem.

1

Boosting

Problem 3 (15 points) Consider the AdaBoost algorithm we discussed in the class 1. Ad- aBoost is an example of ensemble classifiers where the weights in next round are decided based on the training error of the weak classifier learned on the current weighted training set. We wish to run the AdaBoost on the dataset provided in Table 1.

Instance Color Size Shape Edible? D1 Yellow Small Round Yes

D2

Yellow

Small

Round

No

D3

Green

Small

Irregular

Yes

D4

Green

Large

Irregular

No

D5

Yellow

Large

Round

Yes

D6

Yellow

Small

Round

Yes

D7

Yellow

Small

Round

Yes

D8

Yellow

Small

Round

Yes

D9

Green

Small

Round

No

D10

Yellow

Large

Round

No

D11

Yellow

Large

Round

Yes

D12

Yellow

Large

Round

No

D13

Yellow

Large

Round

No

D14

Yellow

Large

Round

No

D15

Yellow

Small

Irregular

Yes

D16

Yellow

Large

Irregular

Yes

Table 1: Mushroom data with 16 instances, three categorical features, and binary labels.

a) Assume we choose the following decision stump f1 (a shallow tree with a single decision node), as the first predictor (i.e., when training instances are weighted uniformly):

  if(Color is Yellow):
        predict  Eadible = Yes
  else:
        predict Eadible = No

What would be the weight of f1 in final ensemble classifier (i.e., α1 in f(x) = ?Ki=1 αifi(x))?

b) After computing f1, we proceed to next round of AdaBoost. We begin by recomputing data weights depending on the error of f1 and whether a point was (mis)classified by f1. What is the weight of each instance in second boosting iteration, i.e., after the points have been re-weighted? Please note that the weights across the training set are to be uniformly initialized.

c) In AdaBoost, would you stop the iteration if the error rate of the current weak classifier on the weighted training data is 0?

1Please read reading assignement from textbook and (optional) the Introduction chapter from Boosting book for detailed explanation https://bit.ly/2HvKObl

2

Experiment with non-linear classifiers
Problem 4 (40 points) For this problem, you will need to learn to use software libraries for

at least two of the following non-linear classifier types:
• Boosted Decision Trees (i.e., boosting with decision trees as weak learner)

• Random Forests

• Support Vector Machines with Gaussian Kernel

All of these are available in scikit-learn, although you may also use other external libraries (e.g., XGBoost 2 for boosted decision trees and LibSVM for SVMs). You are welcome to implement learning algorithms for these classifiers yourself, but this is neither required nor recommended.

Pick two different types of non-linear classifiers from above for classification of Adult dataset. You can the download the data from a9a in libSVM data repository. The a9a data set comes with two files: the training data file a9a with 32,561 samples each with 123 features, and a9a.t with 16,281 test samples. Note that a9a data is in LibSVM for- mat. In this format, each line takes the form <label> <feature-id>:<feature-value> <feature-id>:<feature-value> ….. This format is especially suitable for sparse datasets. Note that scikit-learn includes utility functions (e.g., load svmlight file in example code below) for loading datasets in the LibSVM format.

For each of learning algorithms, you will need to set various hyperparameters (e.g., the type of kernel and regularization parameter for SVM, tree method, max depth, number of weak classifiers, etc for XGBoost, number of estimators and min impurity decrease for Random Forests). Often there are defaults that make a good starting point, but you may need to adjust at least some of them to get good performance. Use hold-out validation or K-fold cross-validation to do this (scikit-learn has nice features to accomplish this, e.g., you may use train test split to split data into train and test data and sklearn.model selection for K-fold cross validation). Do not make any hyperparameter choices (or any other similar choices) based on the test set! You should only compute the test error rates after you have settled on hyperparameter settings and trained your two final classifiers.

What to submit (in PDF file and Jupyter/python codes):

  1. Names of the two classifiers you opt to learn and a brief description of each algorithm and how it works.
  2. Description of your training methodology, with enough details so that another ma- chine learning enthusiast can reproduce the your results. You need to submit all the codes (python and Jupyter notebooks) to reproduce your code. Please use pre- fix Problem4*.py where you need to replace ‘*‘ with the name of non-linear classifier for your coding files.
  3. The list of hyperparameters and breif description of each hyperparameter you tuned in training, their default values, and the final hyperparameter settings you use to get the best result.

2A simple blog post on how to use XGBoost please check this. 3

  1. Training error rates, hold-out or cross-validation error rates, and test error rates for your two final classifiers. You are also encouraged to report other settings you tried with the accuracy it achieved (please make a table with a column with each hyperparamter and accuracy of configuration of parameters.
  2. Please do your best to obtain the best achievable accuracy for each classifier on given dataset. Note: The amount of effort you put on tuning the parameters will be deter- mined based on the discrepancy between the accuracy you get and the best achievable accuracy on a9a data for each algorithm.

Parameters to be tuned for XGBoost: 1. n estimators
2. max depth
3. lambda

4. learning rate 5. missing
6. objective

Parameters to be tuned for SVM: 1. kernel type
2. gamma
3. C

Parameters to be tuned for Random Forests: 1. n estimators
2. bootstrap
3. max depth

4. min impurity decrease 5. min samples leaf

Example code to use XGBoost

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_svmlight_file
from xgboost import XGBClassifier

4

# load data in LibSVM sparse data format
X, y = load_svmlight_file("a9a")
# split data into train and test sets
seed = 6
test_size = 0.4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
# fit model on training data
# for simplicity we fit based on default values of hyperparameters
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred] # evaluate predictions
accuracy = accuracy_score(y_test, predictions) print(“Accuracy: %.2f%%” % (accuracy * 100.0))

5

Clustering

Problem 5 (45 points) For this problem, you will implement the k-means++ algorithm in Python. You will then use it to cluster Iris dataset from the UCI Machine Learning Repository. The data is contained the iris.data file under ”Data Folder”, while the file iris.names contains a description of the data. The features x are given as the first four comma-separated values in each row in the data file. The labels y are the last entry in each row, but you do NOT need the class label as it is unknown for clustering.

1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)
5. class: {Iris Setosa, Iris Versicolour, Iris Virginica}

You need to,

  • Create a new data set with two features by computing the ratio of raw features (x =
    (x1, x2) where x1 = (sepal length/sepal width) and x2 = (petal length/petal width)) and plot the data to observe the clusters in data by yourself (use class label to color
    the data points for better illustration of clusters).
  • Implement the k-means++ algorithm. You are provided with the skeleton of the code with main functions to be implemented (Problem5.py file in assignment directory). Submit the source code (documented!) of your implementation.
  • Cluster the modified Iris dataset with with two features explained above. Run your algorithm 50 times over the data with different values of clusters k = 1,2,…,5 and plot the accuracies (x and y axes should be the number of clusters and the clustering objective, respectively).
  • Based on the above plot, decide the number of final clusters and justify your answer. For the chosen number of clusters,

    1. Create a plot showing how objective changes with number of iterations.
    2. Create a plot with the data colored by assignment, and the cluster centers.

6