DSO560 – Solved – Assignment Hub

Description

5/5 – (1 vote)

DSO 560 – Text Analytics & Natural Language Processing

No exams will be accepted past 8:35pm PST

Instructions:
• WRITE ALL ANSWERS ON SEPARATE SHEETS OF PAPER

• DO ALL SECTIONS.

ONCE YOU SUBMIT YOUR EXAM, YOU CAN LEAVE CLASS

SHOW ALL WORK TO RECEIVE PARTIAL CREDIT

Short Answer (5 pts, recommended 30 minutes)
Pick 5 of the short answer questions below to answer. Write no more than 2 sentences in your explanation. Each question is 1pt: 0.5pts for the correct answer and 0.5pts for a correct explanation.

Use the following word2vec embedding table for Q1-Q2
Word
A -1.0 2.0 1.0
B 1.0 0 1.0
C 2.0 -2.0 0
D 1.0 -1.0 0
NULL/UNKNOWN 0 0 0

1. Which pair of words would have the highest cosine similarity score with each other? Why?

2. Write what the input sequence to a sequential model would look like for the document “A B A D” and sequence length 5.

3. Write a regex with named capture groups that extracts the username and domain (username@domain) from an email address.

4. Provide an example of when an InvalidContinuationByte error could occur.

5. What is one way a Naïve Bayes model could deal with a test document containing a token it has not seen before during training?

6. Provide the binary encoding for the character 🥰 in UTF8. Identify which bits are continuation bytes.

The following sections will be condensed, and only 2 of the following sections will appear. You can look through prior years’ exams – not much has changed / will change:
Naïve Bayes (3 pts, recommended 15 minutes)
Vectorization and Similarity (3 pts, recommended 15 minutes)
N-Gram Language Models (3 pts, recommended 15 minutes)

True/False (5 pts, recommended 20 minutes)

Pick 5 of the statements below, indicate if it is true or false. In both cases (true or false), explain your reasoning in a brief sentence. Each question is worth 1pt: 0.5pts for the correct answer, 0.5pts for explanation.

A. The documents “John loves cats” and “cats love John” would have the same vector if we used TF-IDF or word count vectorization.
B. In a Hidden Markov Model, the word we observe at sequence step 2 is determined by the word we observe at sequence step 1.
C. If you want to make sure you keep human-readable tokens, you should pick lemmatization over stemming.
D. According to Zipf Law, there is a relatively uniform distribution of word counts across all the words in a corpus’ vocabulary.
E. Using a longer sequence length for your RNN models will reduce the chance of vanishing/exploding gradients.
F. LSTMs calculate all of their hidden states across all sequences at once.

Reviews

There are no reviews yet.

Be the first to review “DSO560 – Solved”

DSO560 – Solved

Description

Reviews

Related products

DSO560 – Text Analytics & Natural Language Processing Project Brief Solved

DSO560 – Solved

DSO560 – Solved