- Warm-up (25 points)
Answer the following in your own words (read and understand the concepts, do not copy answers from the internet):
(a.) Why do deep neural networks typically outperform shallow networks?
(b.) What are the issues involved in training very deep neural networks from an optimization perspective?
(c.) What are the issues involved in training very deep neural networks from computation and memory perspective?
(d.) What is a validation dataset used for?
(e.) The following learning rate plots show the loss of an algorithm versus time. Label each plot with one of these labels: (i) Low learning rate (ii) Optimal learning rate (iii) High learning rate (iv) Very high learning rate
(f.) What is the following activation function, and why is it used? How is it different from ReLU.
(g.) For a minibatch with mean M and variance V, what does batch normalization do?
(h.) In one or more sentences, and using sketches as appropriate, contrast: AlexNet, VGG-Net, GoogleNet and ResNet. What was one defining characteristic of each?
(i.) Define the following terms: (i) Hard attention (ii) Soft attention. Which attention models can be trained with backpropagation only? What other training method is required? Briefly explain why.
(j.) List 4 transformations that can be applied to images in a dataset to augment the dataset for CNN training.
- Cross correlation (15 points)
Implement the 2D cross-correlation process in the cross_corr_2d function, which accepts an input tensor X and a kernel tensor K and returns an output tensor Y. Refer to lecture slides for the details.
In : def cross_corr_2d(X, K):
Computes the 2D cross-correlation operation
– X: Input data
– K: The kernel tensor
– Y: The output tensor after performing cross-correlation
# TODO: Implement the 2D cross-correlation here #
# END OF YOUR CODE #
Y = None
X = torch.tensor([[0.0, 5.0, 8.0], [2.0, –1.0, 4.0], [4.0, 3.0, 7.0]])
K = torch.tensor([[5.0, –1.0], [3.0, 1.0]])
- Spatial batch-norm (25 points)
One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp,or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization which was proposed by .
The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.
The authors of  hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem,  proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.
It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.
Knowing that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called “spatial batch normalization.”
Normally batch-normalization accepts inputs of shape (N, D) and produces outputs of shape (N, D) , where we normalize across the minibatch dimension N . For data coming from convolutional layers, batch normalization needs to accept inputs of shape (N, C, H, W) and produce outputs of shape (N, C, H, W) where the N dimension gives the minibatch size and the (H, W) dimensions give the spatial size of the feature map.
If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different imagesand different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the C feature channels by computing statistics over both the minibatch dimension N and the spatial dimensions H and W .
 Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
In : def spatial_batchnorm_forward(x, gamma, beta, bn_param):
Computes the forward pass for spatial batch normalization.
– x: Input data of shape (N, C, H, W)
– gamma: Scale parameter, of shape (C,)
– beta: Shift parameter, of shape (C,)
– bn_param: Dictionary with the following keys:
– mode: ‘train’ or ‘test’; required
– eps: Constant for numeric stability
– momentum: Constant for running mean / variance. momentum=0 means that old information is discarded completely at every time step, while momentum=1 means that new information is never incorporated. The default of momentum=0.9 should work well in most situations.
– running_mean: Array of shape (D,) giving running mean of features
– running_var Array of shape (D,) giving running variance of features
Returns a tuple of:
– out: Output data, of shape (N, C, H, W)
– cache: Values needed for the backward pass
out, cache = None, None
# TODO: Implement the forward pass for spatial batch normalization. #
# HINT: You can implement spatial batch normalization using the vanilla #
# version of batch normalization. #
return out, cache
def spatial_batchnorm_backward(dout, cache):
Computes the backward pass for spatial batch normalization.
– dout: Upstream derivatives, of shape (N, C, H, W)
– cache: Values from the forward pass
Returns a tuple of:
– dx: Gradient with respect to inputs, of shape (N, C, H, W)
– dgamma: Gradient with respect to scale parameter, of shape (C,)
– dbeta: Gradient with respect to shift parameter, of shape (C,)
dx, dgamma, dbeta = None, None, None
# TODO: Implement the backward pass for spatial batch normalization. #
# # #
return dx, dgamma, dbeta
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx