Description
Neural Networks
1 Introduction
There is no doubt that neural networks are a very important class of machine learning models. Given the sheer number of people who are achieving impressive results with neural networks, one might think that it’s relatively easy to get them working. This is a partly an illusion. One reason so many people have success is that, thanks to GitHub, they can copy the exact settings that others have used to achieve success. In fact, in most cases they can start with “pre-trained” models that already work for a similar problem, and “fine-tune” them for their own purposes. It’s far easier to tweak and improve a working system than to get one working from scratch. If you create a new model, you’re kind of on your own to figure out how to get it working: there’s not much theory to guide you and the rules of thumb do not always work. Understanding even the most basic questions, such as the preferred variant of SGD to use for optimization, is still a very active area of research.
One thing is clear, however: If you do need to start from scratch, or debug a neural network model that doesn’t seem to be learning, it can be immensely helpful to understand the low-level details of how your neural network works – specifically, back-propagation. With this assignment, you’ll have the opportunity to linger on these low-level implementation details. Every major neural network type (RNNs, CNNs, Resnets, etc.) can be implemented using the basic framework we’ll develop in this assignment.
To help things along, we have designed a minimalist framework for computation graphs and put together some support code. The intent is for you to read, or at least skim, every line of code provided, so that you’ll know you understand all the crucial components and could, in theory, create your own from scratch. In fact, creating your own computation graph framework from scratch is highly encouraged – you’ll learn a lot.
2 Computation Graph Framework
To get started, please read the tutorial on the computation graph framework we’ll be working with. (Note that it renders better if you view it locally.) The use of computation graphs is not specific to machine learning or neural networks. Computation graphs are just a way to represent a function that facilitates efficient computation of the function’s values and its gradients with respect to inputs. The tutorial takes this perspective, and there is very little in it about machine learning, per se.
To see how the framework can be used for machine learning tasks, we’ve provided a full implementation of linear regression. You should start by working your way through the __init__ of the LinearRegression class in linear_regression.py. From there, you’ll want to review the node class definitions in nodes.py, and finally the class ComputationGraphFunction in graph.py. ComputationGraphFunction is where we repackage a raw computation graph into something that’s more friendly to work with for machine learning. The rest of linear_regression.py is fairly routine, but it illustrates how to interact with the ComputationGraphFunction.
Once you’ve understood how linear regression works in our framework, you’re ready to start implementing your own algorithms…
3 Ridge Regression
When moving to a new system, it’s always good to start with something familiar. But that’s not the only reason we’re doing ridge regression in this homework. As we discussed in class, in ridge regression the parameter vector is “shared”, in the sense that it’s used twice in the objective function. In the computation graph, this can be seen in the fact that the node for the parameter vector has two outgoing edges. While we don’t have this sharing in the multilayer perceptron, we do have it in RNNs and CNNs, which are two of the most important neural network architectures in use today. In the context of RNNs and CNNs, this parameter sharing is also referred to as parameter tying. So being able to handle the shared parameters in ridge regression is an important prerequisite to handling more sophisticated models.
We’ve provided some skeleton code in ridge_regression.py and some test code in ridge_regression.t.py, which you should eventually be able to pass.
1. Complete the class L2NormPenaltyNode in nodes.py.
2. Complete the class SumNode in nodes.py.
3. Implement ridge regression with w regularized and b unregularized. Do this by completing the __init__ method in ridge_regression.py, using the classes created above. When complete, you should be able to pass the tests in ridge_regression.t.py. Report the average square error on the training set for the parameter settings given in the main() function.
4. [Optional] Create a new implementation of ridge regression that supports efficient minibatching. You will replace the ValueNode x, which contains a vector, with a ValueNode X, which contains a matrix. The convention is that the first dimension indexes examples and the second indexes features, as we have done throughout the course. Many of the nodes will have to be adapted to this use case. Demonstrate the use of minibatching for your ridge regression network, and note the amount of speedup you get.
4 Multilayer Perceptron
In this problem, we’ll be implement a multilayer perceptron (MLP) with a single hidden layer and a square loss. We’ll implement the computation graph illustrated below:
Multilayer Perceptron, 1 hidden layer, square loss Parameters
bzeR
beRm
i
×
Hidden Layer WZERKMV Prediction
Whe, Rm×d =>n L=Wx+bµµt1 an ) n) hermh> f- Wzhatbzk
j I XERD ) 11f- ER J FER
yer F- I YY >
Data Objective Value
The crucial new piece here is the nonlinear hidden layer, which is what makes the multilayer perceptron a significantly larger hypothesis space than linear prediction functions.
4.1 The standard non-linear layer
The multilayer perceptron consists of a sequence of “layers” implementing the following non-linear function
h(x) = σ (Wx + b),
where x ∈ Rd, W ∈ Rm×d, and b ∈ Rm, and where m is often referred to as the number of hidden units or hidden nodes. σ is some non-linear function, typically tanh or ReLU, applied element-wise to the argument of σ. Referring to the computation graph illustration above, we will implement this nonlinear layer with two nodes, one implementing the affine transform L = W1x+b1, and the other implementing the nonlinear function h = tanh(L). In this problem, we’ll work out how to implement the backward method for each of these nodes.
4.1.1 The Affine Transformation
.
1. Show that , where x = (x1,…,xd)T. [Hint: Although not necessary, you might find it helpful to use the notation . So, for examples,
2xiδij = 2xj.]
2. Now let’s vectorize this. Let’s write for the column vector whose ith entry is . Let’s also define the matrix ∂W∂J ∈ Rm×d, whose ij’th entry is . Generally speaking, we’ll always take to be an array of the same size (“shape” in numpy) as A. Give a vectorized expression for in terms of the column vectors and x. [Hint: Outer product.]
3. In the usual way, define , whose i’th entry is . Show that
4. Show that ∂J∂b = ∂J∂y, where is defined in the usual way.
4.1.2 Element-wise Transformers
Our nonlinear activation function nodes take an array (e.g. a vector, matrix, higher-order tensor, etc), and apply the same nonlinear transformation σ : R → R to every element of the array. Let’s abuse notation a bit, as is usually done in this context, and write σ(A) for the array that results from applying σ(·) to each element of A. If σ is differentiable at x ∈ R, then we’ll write σ0(x) for the derivative of σ at x, with σ0(A) defined analogously to σ(A).
Suppose the objective function value J is written as J = f(σ(A)), for some function f : S 7→ R, where S is an array of the same dimensions as σ(A) and A. As before, we want to find the array for any A. Suppose for some A we have already computed the array for S = σ(A).
At this point, we’ll want to use the chain rule to figure out . However, because we’re dealing with arrays of arbitrary shapes, it can be tricky to write down the chain rule. Appropriately, we’ll use a tricky convention: We’ll assume all entries of an array A are indexed by a single variable. So, for example, to sum over all entries of an array A, we’ll just write Pi Ai.
1. Show that , where we’re using to represent the Hadamard product. If A and B are arrays of the same shape, then their Hadamard product is an array with the same shape as A and B, and for which . That is, it’s just the array formed by multiplying corresponding elements of A and B. Conveniently, in numpy if A and B are arrays of the same shape, then A*B is their Hadamard product.
4.2 MLP Implementation
1. Complete the class AffineNode in nodes.py. Be sure to propagate the gradient with respect to x as well, since when we stack these layers, x will itself be the output of another node that depends on our optimization parameters.
2. Complete the class TanhNode in nodes.py. As you’ll recall, .
Note that in the forward pass, we’ll already have computed tanh of the input and stored it in self.out. So make sure to use self.out and not recalculate it in the backward pass.
3. Implement an MLP by completing the skeleton code in mlp_regression.py and making use of the nodes above. Your code should pass the tests provided in mlp_regression.t.py. Note that to break the symmetry of the problem, we initialize our weights to small random values, rather than all zeros, as we often do for convex optimization problems. Run the MLP for the two settings given in the main() function and report the average training error. Note that with an MLP, we can take the original scalar as input, in the hopes that it will learn nonlinear features on its own, using the hidden layers. In practice, it is quite challenging to get such a neural network to fit as well as one where we provide features.
4.3 [OPTIONAL]
1. [Optional] Implement a Softmax node.
2. [Optional] Implement a negative log-likelihood loss node for multiclass classification.
3. [Optional] Use the classes above to apply an MLP to the simple multiclass classification dataset we had on a previous assignment.




Reviews
There are no reviews yet.