# 机器学习代写 | COMS 4732: Computer Vision 2

Forward Pass: Multi-layer Perceptron
Hidden Layer
u = xW + c
h = max (0; u)
Output Layer
z = hw + b
Softmax Cross-Entropy Loss Layer

Note that yi is the true class label

data loss

regularization loss
Backpropogation: Multi-layer Perceptron
Loss Layer
Denote the softmax probability for element zk as pk.

Recall the Softmax Cross-Entropy Loss function.

Simplifying, we get:

where 1 is the Indicator function
Output Layer
First, note that:

Hence, we can perform gradient descent using the following gradients: (Note: The w is
obtained by taking the gradient of the regularization term within our Loss function,
Hidden Layer
Weight updates:
@L
@W = XT @L
@u
+ W
@L
@c
= @L
@u
But how do we obtain @h
@u and by extension @L
@u ???
Derivative of a vector with respect to another vector: Using the
Jacobian Matrix to compute @h
@u
But, what is a Jacobian?
Let f : Rn ! Rm be a function that takes x 2 Rn as input and produces the vector f (x) 2 Rm
as output. The Jacobian matrix of f is then de ned to be an m  n matrix, denoted by J,
whose (i; j) the entry is Jij = @fi

where rTfi (now a row vector) is the transpose of the gradient of the i component .
In our case, we have the RELU activation function that serves as function f.

However, since our activation function is only a function of each individual element, the
partials with respect to the other dimensions is 0. Thus, the Jacobian is a diagonal matrix
and hence, we can simplify this expression (making it easier to implement in our code) into an
element-wise product as follows:

where is the element-wise multiplication operator, also known as the Hadamard operator.
Note: We derive the Bias gradient also leveraging the Jacobian
Similar to above, for bias vector c, we need the Jacobian matrix rcL. However, this is also a
diagonal matrix and we can use the same ‘rewriting as element-wise product’ trick. E-mail: itcsdx@outlook.com  微信:itcsdx 