DSGA1003 – Homework 6: Multiclass, Trees, and Gradient Boosting Solved

Description

5/5 – (1 vote)

1 Reformulations of Multiclass Hinge Loss
1.1 Multiclass setting review
In the spirit of empirical risk minimization, we would like to find f ∈ F minimizing the empirical cost-sensitive loss:
,
possibly with some regularization. But this is clearly intractable, since we already know binary classification is intractable and that’s a special case of this formulation. In lecture we proposed an alternative, tractable objective function: the multiclass SVM based on the convex multiclass hinge loss.
1.2 Two versions of multiclass hinge loss (or generalized hinge loss)
In lecture, we defined the margin of the compatibility score function h on the ith example (xi,yi) for class y as mi,y(h) = h(xi,yi) − h(xi,y).
We also gave a formulation of a multiclass SVM objective function, where the loss on an individual example (xi,yi) was
`1(h,(xi,yi)) = max (max[0,∆(yi,y) − mi,y(h)]).
y∈Y−{yi}
There’s an alternative formulation, called the generalized hinge loss in SSBD Section 17.2. There they define `2(h,(xi,yi)) = max[∆(yi,y) + h(xi,y) − h(xi,yi)]. y∈Y
1. Show that if ∆(y,y) = 0 for all `2 (h,(xi,yi)) = `1(h,(xi,yi)). [Hint: Note that maxy∈Y φ(y) = max φ(yi),maxy∈Y−{yi} φ(yi) .]
2. In the context of the generalized hinge loss, we’ve said that ∆(yi,y) is like the “target margin” between the score for true class yi and the score for class y. Suppose that for our compatibility function h, all target margins are reached or exceeded on xi. That is
mi,y(h) = h(xi,yi) − h(xi,y) ≥ ∆(yi,y),
for all y ∈ Y − {yi}. Assume that ∆(yi,y) > 0 ∀y 6= yi and ∆(yi,y) = 0 for y = yi. ]
(a) Show that under the conditions above, `1(h,(xi,yi)) = `2(h,(xi,yi)) = 0.
(b) Show that under the conditions above, we make the correct prediction on xi. That is, f(xi) = argmaxy∈Y h(xi,y) = yi.
2 SGD for Multiclass Linear SVM
Suppose our output space and our action space are given as follows: Y = A = {1,…,k}. Given a non-negative class-sensitive loss function ∆ : Y ×A → [0,∞) and a class-sensitive feature mapping Ψ : X × Y → Rd. Our prediction function f : X → Y is given by
fw(x) = argmaxhw,Ψ(x,y)i
y∈Y
For training data (x1,y1),…,(xn,yn) ∈ X×Y, let J(w) be the `2-regularized empirical risk function for the multiclass hinge loss. We can write this as
,
for some
3. Give an expression for the stochastic subgradient based on the point (xi,yi).
4. Give an expression for a minibatch subgradient, based on the points (xi,yi),…,(xi+m−1,yi+m−1).
3 [Optional] Hinge Loss is a Special Case of Generalized Hinge Loss
Let Y = {−1,1}. Let ∆(y,yˆ) = 1(y 6= yˆ). If g(x) is the score function in our binary classification setting, then define our compatibility function as
h(x,1) = g(x)/2
h(x,−1) = −g(x)/2.
Show that for this choice of h, the multiclass hinge loss reduces to hinge loss:
`(h,(x,y)) = max[∆(y,y0)) + h(x,y0) − h(x,y)] = max{0,1 − yg(x)}
y0∈Y
4 Multiclass Classification – Implementation
In this problem we will work on a simple three-class classification example, similar to the one given in lecture. The data is generated and plotted for you in the skeleton code.
4.1 One-vs-All (also known as One-vs-Rest)
In this problem we will implement one-vs-all multiclass classification. Our approach will assume we have a binary base classifier that returns a score, and we will predict the class that has the highest score.
1. Complete the class OneVsAllClassifier in the skeleton code. Following the OneVsAllClassifier code is a cell that extracts the results of the fit and plots the decision region. Include these results in your submission.
4.2 Multiclass SVM
In this question, we will implement stochastic subgradient descent for the linear multiclass SVM, as described in lecture and in this problem set. We will use the class-sensitive feature mapping approach with the “multivector construction”, as described in our multiclass classification lecture and in SSBD Section 17.2.1.
1. Complete the skeleton code for multiclass SVM. Following the multiclass SVM implementation, we have included another block of test code. Make sure to include the results from these tests in your assignment, along with your code.
5 [Optional] Audio Classification
In this problem, we will work on the urban sound dataset URBANSOUND8K from the Center for Urban Science and Progress (CUSP) at NYU. (You should download the data from that link.) We will first extract features from raw audio data using the LibROSA package, and then we will train multiclass classifiers to classify the sounds into 10 sound classes. URBANSOUND8K dataset contains 8732 labeled sound excerpts broken into 10 folds. Use folds 1 and 2 for training, and folds 3 and 4 for validation.
1. In LibROSA, there are many functions for visualizing audio waves and spectra, such as display.waveplot() and display.specshow(). Load a random audio file from each class as a floating point time series with librosa.load(), and plot their waves and linear-frequency power spectrogram. If you are interested, you can also play the audio in the notebook with functions display() and Audio() in IPython.display.
2. Mel-frequency cepstral coefficients (MFCC) are a commonly used feature for sound processing. We will use MFCC and its first and second differences (like discrete derivatives) as our features for classfication. First, use function feature.mfcc() from LibROSA to extract MFCC features from each audio sample. (The first MFCC coefficient is typically discarded in sound analysis, but you do not need to. You can test whether this helps in the optional problem below.) Next, use function feature.delta() to calculate the first and second differences of MFCC. Finally, combine these features and normalize each feature to zero mean and unit variance.
4. [More Optional] Compare results to any other multiclass classification methods of your choice.
5. [More Optional] Try different feature sets and see how they affect performance.
6 [Optional] Decision Tree Implementation
In this problem we’ll implement decision trees for both classification and regression. The strategy will be to implement a generic class, called Decision_Tree, which we’ll supply with the loss function we want to use to make node splitting decisions, as well as the estimator we’ll use to come up with the prediction associated with each leaf node. For classification, this prediction could be a vector of probabilities, but for simplicity we’ll just consider hard classifications here. We’ll work with the classification and regression data sets from previous assignments.
1. [Optional] Complete the class Decision_Tree, given in the skeleton code. The intended implementation is as follows: Each object of type Decision_Tree represents a single node of the tree. The depth of that node is represented by the variable self.depth, with the root node having depth 0. The main job of the fit function is to decide, given the data provided, how to split the node or whether it should remain a leaf node. If the node will split, then the splitting feature and splitting value are recorded, and the left and right subtrees are fit on the relevant portions of the data. Thus tree-building is a recursive procedure. We should have as manyDecision_Tree objects as there are nodes in the tree. We will not implement pruning here. Some additional details are given in the skeleton code.
3. [Optional] Complete the function mean_absolute_deviation_around_median (MAE). Use the code provided to fit the Regression_Tree to the krr dataset using both the MAE loss and median predictions. Include the plots for the 6 fits.
7 Gradient Boosting Machines
Recall the general gradient boosting algorithm , for a given loss function ` and a hypothesis space F of regression functions (i.e. functions mapping from the input space to R):
1. Initialize f0(x) = 0.
2. For m = 1 to M:
(a) Compute:
n
g
=1 (b) Fit regression model to −gm:
n hm = argminX((−gm)i − h(xi))2 .
h∈F
i=1
(c) Choose fixed step size νm = ν ∈ (0,1], or take
n
νm = argminX`(yi,fm−1(xi) + νhm(xi)).
ν>0 i=1
(d) Take the step: fm(x) = fm−1(x) + νmhm(x)
3. Return fM.
In this problem we’ll derive two special cases of the general gradient boosting framework: `2Boosting and BinomialBoost.
1. Consider the regression framework, where Y = R. Suppose our loss function is given by
,
and at the beginning of the m’th round of gradient boosting, we have the function fm−1(x). Show that the hm chosen as the next basis function is given by
n
hm = argminX[(yi − fm−1(xi)) − h(xi)]2 .
h∈F
i=1
In other words, at each stage we find the base prediction function hm ∈ F that is the best fit to the residuals from the previous stage. [Hint: Once you understand what’s going on, this is a pretty easy problem.]
2. Now let’s consider the classification framework, where Y = {−1,1}. In lecture, we noted that AdaBoost corresponds to forward stagewise additive modeling with the exponential loss, and that the exponential loss is not very robust to outliers (i.e. outliers can have a large effect on the final prediction function). Instead, let’s consider the logistic loss
,
where m = yf(x) is the margin. Similar to what we did in the `2-Boosting question, write an expression for hm as an argmin over F.
8 Gradient Boosting Implementation
This method goes by many names, including gradient boosting machines (GBM), generalized boosting models (GBM), AnyBoost, and gradient boosted regression trees (GBRT), among others. Although one of the nice aspects of gradient boosting is that it can be applied to any problem with a subdifferentiable loss function, here we’ll keep things simple and consider the standard regression setting with square loss.
2. [Optional] Repeat the previous runs on the classification data set, but use a different classification loss, such as logistic loss or hinge loss. Include the new code and plots of your results. Note that you should still use the same regression tree settings for the base regression algorithm.

Reviews

There are no reviews yet.

Be the first to review “DSGA1003 – Homework 6: Multiclass, Trees, and Gradient Boosting Solved”

DSGA1003 – Homework 6: Multiclass, Trees, and Gradient Boosting Solved

Description

Reviews

Related products

DSGA1003 – Homework 7: Computation Graphs, Backpropagation, and Solved

DSGA1003 – Homework 5: Conditional Probability Models Solved

DSGA1003 – Homework 4: Kernel Methods Solved

DSGA1003 – Homework 2: Lasso Regression Solved

DSGA1003 – Machine Learning and Computational Statistics Solved