Forward Pass: Multi-layer Perceptron
u = xW + c
h = max (0; u)
z = hw + b
Softmax Cross-Entropy Loss Layer
Note that yi is the true class label
Backpropogation: Multi-layer Perceptron
Denote the softmax probability for element zk as pk.
Recall the Softmax Cross-Entropy Loss function.
Simplifying, we get:
where 1 is the Indicator function
First, note that:
Hence, we can perform gradient descent using the following gradients: (Note: The w is
obtained by taking the gradient of the regularization term within our Loss function,
@W = XT @L
But how do we obtain @h
@u and by extension @L
Derivative of a vector with respect to another vector: Using the
Jacobian Matrix to compute @h
But, what is a Jacobian?
Let f : Rn ! Rm be a function that takes x 2 Rn as input and produces the vector f (x) 2 Rm
as output. The Jacobian matrix of f is then dened to be an m n matrix, denoted by J,
whose (i; j) the entry is Jij = @fi
where rTfi (now a row vector) is the transpose of the gradient of the i component .
In our case, we have the RELU activation function that serves as function f.
However, since our activation function is only a function of each individual element, the
partials with respect to the other dimensions is 0. Thus, the Jacobian is a diagonal matrix
and hence, we can simplify this expression (making it easier to implement in our code) into an
element-wise product as follows:
where is the element-wise multiplication operator, also known as the Hadamard operator.
Note: We derive the Bias gradient also leveraging the Jacobian
Similar to above, for bias vector c, we need the Jacobian matrix rcL. However, this is also a
diagonal matrix and we can use the same ‘rewriting as element-wise product’ trick.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx