Description
TAs: Chao Jiang, Chase Perry, Rucha Sathe
Piazza: https://piazza.com/gatech/spring2022/cs4650a
1 Logistic vs Softmax
(a) (2 pts) Recall the Logistic and Softmax functions
ewTx
PLogistic(y = 1|x) = wTx 1 + e
ewyTx
x
Given Y = {0,1}, what should be the value of w such that
PLogistic(y|x) = PSoftmax(y|x) ∀ y ∈ Y? Show your work.
Hint: Expand the summation term and think about w in terms of w0 and w1.
(b) (2 pts) Recall that the Softmax function is a generalization of the logistic sigmoid for multiclass classification. In practice, machine learning software such as PyTorch uses a Softmax implementation for both binary and multiclass classification. Recall that the Softmax function produces a vector output z ∈ R|Y| and the logistic function a single scalar value z, representing class probabilities. Write the equation for a decision rule to produce ˆy from the Softmax function in the binary case (when Y = {0,1}; you can break ties arbitrarily). Write the decision rule to produce ˆy from the logistic function. Compare the two rules. How are they similar and/or different? (1-2 sentences).
2 Multiclass Naive Bayes with Bag of Words
Dr. Smith’s clinic has recently begun to provide a preliminary analysis of whether or not a patient is affected by Virus X, based on the description of the patient’s condition, uploaded online. The resulting variable can take 3 values, namely: Affected, Unaffected, No Diagnosis. The description of the patient’s condition is filtered out on the basis of a select few symptoms, to determine whether or not the patient might be affected. Collected data is displayed in the table below. Dr. Smith’s team wishes to put Naive Bayes algorithm to use to carry out the task at hand.
S.No. body-ache dehydrated headache cold nauseous fever energetic hungry Y
1 0 0 1 1 0 0 0 0 Unaffected
2 1 0 0 1 0 1 1 1 No Diagnosis
3 0 0 1 1 0 1 1 1 Affected
4 0 1 1 1 1 1 0 0 No Diagnosis
5 0 1 0 1 1 0 1 0 Unaffected
6 0 1 1 1 0 0 1 1 Affected
7 0 0 0 1 0 0 0 1 Unaffected
8 1 0 0 0 1 0 0 0 Unaffected
(a) (1 pt) What is the probability θy of each label y ∈ {Unaffected, No Diagnosis, Affected}?
(b) (3 pts) The parameter ϕy,j is the probability of a token j appearing with label y. It is defined by the following equation, where V is the size of the vocabulary set:
count(y,j)
count(y,j′)
The probability of a count of words x and a label y is defined as follows:
V
p(
j=1
Find the most likely label ˆy for the following word counts vector x = (0,1,0,1,1,0,0,1) using –
yˆ = argmaxy logp(x,y;θ;ϕ).
Show final log (base-10) probabilities for each label rounded to 3 decimals. Treat log(0) as −∞.
1 + count(y,j)
add-1 smoothing: ϕy,j = V + PVj′=1 count(y,j′)
3 Perceptron: Linear Separability and Weight Scaling
(a) (2 pts) Suppose we have the following data representing the XOR function:
x1 x2 f(x1,x2)
0 0 -1
0 1 1
1 0 1
1 1 -1
Table 1: XOR function data
Evidently, the data is not linearly separable. Therefore the perceptron algorithm will not be able to learn a classifier for XOR, based on this data.
However, we can add a 3rd dimension/feature to each input such that the data becomes linearly separable. If we add (1,0,0,1) to the 3rd dimension (x3) of the four data points in order, will the new data be linearly separable? Assume 0 is the threshold for classification. Justify your answer.
After the addition of the third dimension, can we say that the perceptron algorithm is actually capable of learning the XOR function? Why or why not?
(b) (2 pts) Suppose we have a trained Perceptron with parameters (W,b). If we scale W by a positive constant factor c, will the new set of weights produce the exact same prediction for all the test data? Assume the threshold for classification is 0. Justify your answer.
(c) (2 pts) With the same setting as 2, this time we translate W by a positive constant factor c (add c to each element of W), will the new set of weights produce the exact same prediction for all the test data? Justify your answer.
4 Feedforward Neural Network
(2 pts) In Question 3, we tried to design a perceptron architecture in order to learn the XOR function represented by Table 1. Now, we want you to design a feedforward neural network to compute the XOR function.
Use a single output node and specify the activation function you choose for it. Also use a single hidden layer with ReLU activation function. Describe all weights and offsets (bias terms).
(Hint: In class, we discussed a neural network design that solves the XOR problem using tanh activation functions.)
5 Dead Neurons
The ReLU activation function can lead to “dead neurons”, which can never be activated on any input. Consider a feedforward neural network with a single hidden layer and ReLU nonlinearity, assuming a binary input vector, x ∈ f{0,1}D and scalar output y:
zi = ReLU( y = θ(z→y) · z
Assume the above function is optimized to minimize a loss function (e.g., mean squared error) using stochastic gradient descent.
(a) (2 pts) Under what condition is node zi “dead”? Your answer should be expressed in terms of the parameters
and bi.
(b) (2 pts) Suppose that the gradient of the loss on a given instance is = 1. Derive gradients and
for such an instance.
(c) (2 pts) Using your answers to the previous two parts, explain why a “dead” neuron can never be brought back to life during gradient-based learning.
(Hint: The notation used for this question is in line with that used in Eisenstein Chapter 3.)




Reviews
There are no reviews yet.