# 机器人代写｜6CCE3ROB/7CCEMROS COURSE WORK 1

这是一篇美国的Python深度学习**python代写**

**Warm-up (25 points)**

**Answer the following in your own words (read and understand the concepts, do not copy ****answers from the internet): **

**(a.) Why do deep neural networks typically outperform shallow networks? **

**(b.) What are the issues involved in training very deep neural networks from an optimization perspective? **

**(c.) What are the issues involved in training very deep neural networks from computation and memory perspective? **

**(d.) What is a validation dataset used for? **

**(e.) The following learning rate plots show the loss of an algorithm versus time. Label each plot with one of these labels: (i) Low ****learning rate (ii) Optimal learning rate (iii) High learning rate (iv) Very high learning rate **

model_training2

**(f.) What is the following activation function, and why is it used? How is it different from ReLU. **

activation

**(g.) For a minibatch with mean M and variance V, what does batch normalization do? **

**(h.) In one or more sentences, and using sketches as appropriate, contrast: AlexNet, VGG-Net, GoogleNet and ResNet. What was ****one defining characteristic of each? **

**(i.) Define the following terms: (i) Hard attention (ii) Soft attention. Which attention models can be trained with backpropagation ****only? What other training method is required? Briefly explain why. **

**(j.) List 4 transformations that can be applied to images in a dataset to augment the dataset for CNN training. **

**Cross correlation (15 points)**

Implement the 2D cross-correlation process in the cross_corr_2d function, which accepts an input tensor X and a kernel tensor K and returns an output tensor Y. Refer to lecture slides for the details.

In [2]: **def **cross_corr_2d(X, K):

“””

Computes the 2D cross-correlation operation

Inputs:

– X: Input data

– K: The kernel tensor

Returns:

– Y: The output tensor after performing cross-correlation

“””

*# TODO: Implement the 2D cross-correlation here # *

**pass **

*# END OF YOUR CODE # *

Y **= ****None **

**return **Y

In [3]:

*# Example *

X **= **torch**.**tensor([[0.0, 5.0, 8.0], [2.0, **–**1.0, 4.0], [4.0, 3.0, 7.0]])

K **= **torch**.**tensor([[5.0, **–**1.0], [3.0, 1.0]])

cross_corr_2d(X, K)

**Spatial batch-norm (25 points)**

One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp,or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization which was proposed by [1].

The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.

The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.

It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.

Knowing that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called “spatial batch normalization.”

**Normally batch-normalization accepts inputs of shape ****(N, D)**** and produces outputs of shape ****(N, D) ****, where we normalize ****across the minibatch dimension ****N ****. For data coming from convolutional layers, batch normalization needs to accept inputs of ****shape ****(N, C, H, W)**** and produce outputs of shape ****(N, C, H, W)**** where the ****N**** dimension gives the minibatch size and the ****(H, W)**** dimensions give the spatial size of the feature map. **

**If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively ****consistent both between different imagesand different locations within the same image. Therefore spatial batch normalization ****computes a mean and variance for each of the ****C**** feature channels by computing statistics over both the minibatch dimension ****N ****and the spatial dimensions ****H**** and ****W ****. **

[1] Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015

In [4]: **def **spatial_batchnorm_forward(x, gamma, beta, bn_param):

“””

Computes the forward pass for spatial batch normalization.

Inputs:

– x: Input data of shape (N, C, H, W)

– gamma: Scale parameter, of shape (C,)

– beta: Shift parameter, of shape (C,)

– bn_param: Dictionary with the following keys:

– mode: ‘train’ or ‘test’; required

– eps: Constant for numeric stability

– momentum: Constant for running mean / variance. momentum=0 means that old information is discarded completely at every time step, while momentum=1 means that new information is never incorporated. The default of momentum=0.9 should work well in most situations.

– running_mean: Array of shape (D,) giving running mean of features

– running_var Array of shape (D,) giving running variance of features

Returns a tuple of:

– out: Output data, of shape (N, C, H, W)

– cache: Values needed for the backward pass

“””

out, cache **= ****None**, **None **

*# TODO: Implement the forward pass for spatial batch normalization. # *

*# # *

*# HINT: You can implement spatial batch normalization using the vanilla # *

*# version of batch normalization. # *

**pass **

**return **out, cache

**def **spatial_batchnorm_backward(dout, cache):

“””

Computes the backward pass for spatial batch normalization.

Inputs:

– dout: Upstream derivatives, of shape (N, C, H, W)

– cache: Values from the forward pass

Returns a tuple of:

– dx: Gradient with respect to inputs, of shape (N, C, H, W)

– dgamma: Gradient with respect to scale parameter, of shape (C,)

– dbeta: Gradient with respect to shift parameter, of shape (C,)

“””

dx, dgamma, dbeta **= ****None**, **None**, **None **

*# TODO: Implement the backward pass for spatial batch normalization. # *

*# # # *

**pass **

**return **dx, dgamma, dbeta