100% Guaranteed Results


Deep Learning – Practical-1: Pen and paper exercises Solved
$ 20.99
Category:

Description

5/5 – (1 vote)

Instructions
• Answer the questions in the following exercises thoroughly. However, this does not necessarily mean long answers. Be precise and compact.
• We highly recommend that your solutions are written in LATEX and must have a easy-to-follow format (e.g. sectionings/subsectionings etc.).
• This exercise belongs to Practical-1, therefore, please include your solution in your main submission file for which the instructions are given separately. (Do not submit individually.)
• For this assignment, use the name below:
paper assignment 1 solved.pdf
• The bonus points are only valid for those submissions with a grade lower than 40.
Exercise-1 (15 pts)
In Figure 1, you are given a 2-layer feedforward neural network. Its variables are given as follows:
• s1 = W1 · xin
• z1 = f1(s1) = f1(W1 · xin)
• s2 = W2 · z1
• z2 = f2(s2) = f2(W2 · f1(W1 · xin))
• sout = Wout · z2
• zout = f3(sout) = f3(Wout · f2(W2 · f1(W1 · xin))) = yout
where fi, i 2 {1,2,3}, denotes any di↵erentiable function (e.g. sigmoid, tanh, relu, etc.) Using these compute (generalized) derivatives of the loss function,
L = 0.5 · (yout ygt)2, with respect to the weights, .
Hint: The final solution for in the form: (yout ygt) · f30(sout) · z2. Show
yout
f1 f2 f3
Figure 1: A network with 3 weights
your work to get to this result and also solve for the latter gradients with the given form in mind. (Note: In this exercise, the weights are some real numbers not matrices.)
Prelude
This exercise has shown that a neural network is nothing but a composite function in the form of fN fN 1 ··· f1(x) = fN(fN 1(…(f1(x)))). From calculus, we are familiar with how to take derivatives (in machine learning, these are usually referred to as gradients) of such functions. Namely, we call it chain rule.
By the end of the Exercise-1 you realized that computing these derivatives this way becomes intractable to solve (at least by hand) as we increase the network dimensions in both width and depth. However, you have seen a pattern emerging in your solutions. In order to have an e cient update mechanism, we can exploit this recursive pattern.
What this recursive pattern tells us is that updating the parameters of a network boils down to determining the rates at which the error, or loss, varies with small perturbations to unit inputs (i.e. sj). Mathematically, this is equivalent to computing . For an output unit this is straightforward. On the other hand, for units in inner layers, we do not have an immediate error signal. We just know how much influence (good or bad) a unit has over the overall error by looking at the weighted edges that branches out from it. It turns out that we can propagate (distribute) the error towards the hidden units by following the edges from top to bottom. The procedure (error propagation) is summarized in Figure 2. We can further formulate it using chain rule as follows:

Once the backward pass computes the js, we can determine Wks
(i) Write down in terms of k.

(a) A neuron in a hidden layer is unaware of its contribution towards network error.

(b) The errors from consecutive layer is propped to theneuron in the hidden layer.
Figure 2: Propagation of error from one layer to another
Exercise-2 (15 pts)
In this exercise you are going to execute forward and backward modes of a neural network on paper for 3 iterations. We illustrate our model in Figure 3. This network comprises of a hidden layer with 3 ReLU units and a squared-error loss as in the first exercise. (Note: Please use tanh as activation function in the output unit.)
ReLU(x) = ( x, if x 0
0, otherwise
(Hint: Notice that this function is non-di↵erentiable around 0. Therefore, when computing your gradients, you will have to compute one gradient if the input is larger than 0, another if it is smaller than 0, and ignore when the input is 0 and thus non-di↵erentiable.)
Step-1: Initialize the weights to these random values as shown here.

W11 W21 W31 0.60 0.70 0.00
=
W12 W22 W32 0.01 0.43 0.88
and
2w1135 240.0235
w12 = 0.03 w13 0.09
The data is provided in the following matrix where samples are stored in rows.

and their corresponding labels are y = [1,1, 1, 1]T.
Step-2: Exploit the recursive patterns you discovered in the first exercise to update parameters of this network. Pseudo-code:
1. Forward the input and record (i) input to the unit, sj, and (ii) output of the unit, zj, for each unit j in all layers.
2. Compute the loss and record it.
Use L = 0.5 · (yout y)2
3. Compute the error signal, at the output.
4. Propagate out backwards to compute at hidden units.

Figure 3: A network with one hidden layer
5. Compute gradients (i.e. W, w) using these and update weights using gradient descent.
Hint: W(t+1) = W(t) ↵ · W
6. Repeat 3 times.
Exercise-3 (10 pts)
Although a softmax layer is usually coupled with a cross-entropy loss, this is not necessary and you can use a di↵erent loss function. In this exercise, you are going to couple softmax layer with multiclass hinge loss and derive the gradient solution with respect to the inputs (to softmax).
Let us set up our problem now. Assume we want to classify points, xi 2Rd⇥1, sampled from a distribution with k (where k > 2) classes. Hence, our network architecture boils down to a k-way softmax layer before the loss. Let us denote the inputs to this layer with oj where j 2 {1,…,k}. Let us also denote the output of softmax layer with pj where j 2 {1,…,k}. Softmax layer applies the following transformation elementwise:
, where
(Hint: Notice that pi depends on all other dimensions j = {1,…,k}. Hence the derivative of the softmax layer, i.e. , implies a Jacobian matrix of k ⇥ k.)
Then, our multiclass hinge loss takes P¯ as argument to compute the following:

where is the loss resulting from ith point in our training set and yi is the groundtruth value for it.
(i) Explain, in (your own) words, what this particular loss function is designedto achieve.
(ii) Derive the gradient for .
(Bonus) Exercise-4 (5 pts)
As you have been told deep learning ’breakthrough’ was mainly thanks to developments in hardware technology (e.g. graphics processing units (GPU)). However, even the compute power o↵ered by the most advanced GPUs are finite. For that very reason, the network architectures one can think of are still bounded by hardware limits (e.g. memory, number of core on your GPU etc.). Should you happen to have designed a very complicated model with billions of parameters you are likely to encounter memory problems during training. What would you do in such a case where you either cannot a↵ord a more expensive, say a device with twice the memory, compute device or there exist no better hardware solution than what you have now?

Reviews

There are no reviews yet.

Be the first to review “Deep Learning – Practical-1: Pen and paper exercises Solved”

Your email address will not be published. Required fields are marked *

Related products