CS771 – 1 Solved – Assignment Hub

Description

5/5 – (1 vote)

We need to minimise the below objective function with respect to wc and Mc. We start off by optmising the function with respect to wc and equate it to 0
(1)
Now with respect to Mc and equating it to 0 we get:

) (2)
Here the RHS of (2) is the covarianace matrix of xn and wc (assuming a uniform probability distribution of xn). Thus,

] , where x ∈ c (3)
(3) shows that Mc is inverse of the covariance matrix.
In the special case that Mc is an identity matrix then the equation in the question boils down to finding the euclidean distance between xn and wc.
) (4)
xn:yn=c
2
Consistency is defined when the error rate while testing is also at Bayes’ optimal. Thus, we have also got a definite decision boundary.
In a noise-free setting, with a sampled set of the population used for training, the decision boundary is decided by the sampled points and even if the training shows 0 error rate the decision boundary is not necessarily optimal. Thus, in a noise-free setting one-nearest neighbour algorithm is not consistent.
3
We can use the same Information Gain formula but replace Entropy with a cost function which measures co-linearity between points say Cosine similarity. Which means:

where
|D|
Sim(D) = X(yi.yj)
i,j
The node with the highest information gain will then be selected as a splitting attribute.
4
Let’s start with the basic formula for linear regression to find f(x∗)

Transposing on both sides. f(x∗) is a scalar so its transpose will yield the same value.
(5)
In (5), corresponds to wn of the equation given in the question. This is quite different from the typical style of obtaining the test response by weighing the training responses based on the inverse of distance between training points and test point. Here we are taking a dot product similarity between the test data point and all the training data points.
5

now let’s write this equation in vector and matrix form to make the algebra easier. So Pyn = y and Pxn = X.
Let’s also assume that X˜ = X.M where M is the matrix obtained by putting in 0 and 1 based on the Bernoulli distribution, with p defining the probability of 1 (include the datapoint) and (1 − p) defining the probability of 0 (excluding the datapoint).
Therefore, E[X˜] = pX and cov(X,˜ X˜) = p(1 − p)Γ2, where Γ is the covariance matrix of M, but as all the values are independent so it is just a diagonal matrix only carrying variance values.
L(w) = (yT − wT X˜ T )(y − Xw)
= E[(yT − wT X˜ T )(y − Xw)]
= E[yT y − 2yT X˜w + wT X˜ T X˜w]
= yT y − 2pyT Xw + wT E[X˜ T X˜]w
From cov(M,M) = E[MMT ] − E[M]E[M]T we get
=⇒ E[L(w)] = yT y − 2pyT Xw + p2wT XT Xw + p(1 − p)wT Γ2w
= (yT − pwT XT )(y − pXw) + p(1 − p)||Γw||2
N
= X(yn − wT xn)2 + p(1 − p)||Γw||2
n=1
This is similar to L2 regularized loss function where λ = p(1 − p)Γ2.
6
1 Learning with Prototypes
We have a data of 4096 features (extracted from a deep learning model). Our task is to predict the class labels of the unseen datapoints.
1. We start off by first looking at the data
2. Then we compute the means of the 40 seen classes as µk
1.1 Method 1 – Using class attributes similarity
We compute the similarity between seen and unseen classes by taking the dot product between the class attribute vectors of the seen and unseen classes like below

Then the similarity values were normalised by dividing each value in the above similarity matrix by the sum of its row.

Then we can estimate the means of the unseen classes by:
unseen means = similarity ∗ means seen
This method gives an accuracy of 46.893%.
1.2 Method 2 – Linear regression
This method starts similar to Method 1 by computing the means of the seen classes. Then we perform linear regression between the class attribute values and the means to get a model to map class attributes to means. We then use this model to compute means of the unseen classes and give predictions.
In the linear regression we can regularize the model with a parameter λ. In this case the chosen values of λ are: 0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10, 20, 50, 100.
Then prediction accuracy is compared with values of λ to get the optimal λ value with the highest accuracy.
This model got highest accuracy of 73.722% with an optimal λ of 6.50.

Figure 1: Accuracy vs lambda. The left chart shows the accuracy score compared to each value of lambda on a linear scale, while the right chart shows lambda on a log scale. The optimal lambda is marked with a vertical blue line

Reviews

There are no reviews yet.

Be the first to review “CS771 – 1 Solved”

CS771 – 1 Solved

Description

Reviews

Related products

CS771 – Instructions: Solved

CS771 – Assignment 3 Solved

CS771 – Assignment 2 Solved

CS771 – Assignment 1 Solved