Heap Allocate Assignment




This HW includes both theory and implementation problems. Please note,

  • Your code must work with Python 3.5+ (you may install the Anaconda distribution of Python)
  • You need to submit a report in PDF including all written deliverable, and all implementation codes so one could regenerate your results.
  • For programming part of PCA problem, you can import any module you would like from scikit-learn or other external libraries to minimize the amount of implementation required to get the coding problem done. Submit your code in a Jupyter notebook named PCA.ipynb.

The submission for this homework should be a single Jupyter notebook file containing all of the relevant code to reporduce your results via Canvas and a single PDF file containing the solutions and figures, and any text explaining your results via Gradescope (please include all the figures/results of Problem 2 in the PDF file for grading purposes).


Problem 1. (10 points) As we discussed in the lecture, the PCA algorithm is sensitive to scaling and pre-processing of data. In this problem, we explore few data pre-processing operations on data matrices.

  1. a) Explain what does each of the following data processing operation mean, why it is important, and how you apply to a data matrix X Rn×d, n samples each with d features 1 (if it helps, use mathematical equations to show the calculations).

(a) Centering or mean removal

(b) Scaling of a feature to a range [a, b]

(c) Standardization (feature level)

(d) Normalization (sample level)

  1. b) Apply the above operations separately to the following data matrix with n = 5 samples and d = 2 features and show the processed data matrix. For scaling pick a = 0 and b = 1.

Problem 2. (25 points) In this assignment, you will explore PCA as a technique for discerning whether low-dimensional structure exists in a set of data and for finding good representations of the data in that subspace. To this end, you will do PCA on Iris dataset which can be loaded in scikit-learn using following commands:

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data

y = iris.target

a) Carry out a principal component analysis on the entire raw dataset (4-dimensional instances) for k = 1, 2, 3, 4 components. How much of variance in data can be explained by the first principal component? How does the fraction of variation explained in data vary as k varies?

b) Apply the standardization operations from Problem 1 on raw features and repeat part

(a) on processed data. Explain any differences you observe compared to part (a) and justify.

c) Project the raw four dimensional data down to a two dimensional subspace generated by first two top principle components (PCs) from part (b) and produce a scatter plot of the data. Make sure you plot each of the three classes differently (using color or different markers). Can you see the three Iris flower clusters?

d) Either use your k-means++ implementation from previous homework or from scikit-learn to cluster data from part (c) into three clusters. Explain your observations.

Matrix Factorization

Problem 3. (15 points) Recall the following objective for extracting latent features from a partially observed rating matrix via matrix factorization (MF) for making recommendations,discussed in the class:


  • n: number of users
  • m: number of items
  • R Rn×m: input partially observed rating matrix
  • [n]×[m]: index of observed entries in rating matrix, where [n] denotes the sequence of numbers {1, 2, . . . , n}.
  • k: number of latent features
  • U Rn×k : the (unknown) matrix of latent feature vectors for n users (the ith row

ui Rk is the latent features for ith user)

  • V Rm×k : the (unknown) matrix of latent feature vectors for m items (the jth row

vj Rk is the latent features for jth movie)

Please do the followings:

  1. In solving Equation (1) with iterative Alternating Minimization algorithm (fixing V (t) and taking gradient step for U(t) and vice verse), discuss what happens if U (0) and V (0) are initialized to zero?
  1. Discuss why when there is no regularization in basic MF formulated in Equation (1),i.e., α = β = 0, each user must have rated at least k movies, and each movie must have been rated by at least k users. 2
  1. Computing the closed form solution in part (2) could be computational burden for large number of users or movies. A remedy for this would be using iterative optimization algorithms such as Stochastic Gradient Descent (SGD). Assume we run SGD for T iterations, where at each iteration t = 1, 2, . . . , T we sample a rating (it , jt) Ω uniformly at random and update the latent features for user it and movie jt . Derive the updating rules for u(t)it and v(t)jt at tth iteration of SGD. Show the detailed steps and write the pseudo code clearly.