Description
(ENGR-E 511) Homework 2
Instructions
• Submission format: Jupyter Notebook + HTML)
– Your notebook should be a comprehensive report, not just a code snippet. Mark-ups are mandatory to answer the homework questions. You need to use LaTeX equations in the markup if you’re asked.
– Google Colab is the best place to begin with if this is the first time using iPython notebook.
– Download your notebook as an .html version and submit it as well, so that the AIs can check out the plots and audio. Here is how to convert to html in Google Colab.
– Meaning you need to embed an audio player in there if you’re asked to submit an audio file
• Avoid using toolboxes.
P1: De-beeper [5 points]
1. x.wav is a speech signal contaminated by a beep sound. As I haven’t taught you guys how to properly do speech enhancement yet, you’re not supposed to know a machine learning-based solution to this problem (don’t worry I’ll cover it soon). Instead, you did learn how to do STFT, so I want you to at least manually erase the beep sound from this signal to recover the clean speech source. If you have a perfect pitch, you might be able to know the pitch of the beep, but I assume that you don’t. That’s why you have to see the spectrogram to find out the beep frequency.
2. First off, create a DFT matrix F using the first equation in M02-S12. You’ll of course create a N × N complex matrix, but if you see its real and imaginary versions separately, you’ll see something like the ones in M02-S14 (the ones in the slide are 20 × 20). Determine your N, which is the frame size shown in M02-S17. For example, since the signal’s sampling rate is 16kHz, if your N is 1600, your frequency resolution (the range a Fourier coefficient covers) will be 10Hz. If your N is large, you’ll get finer frequency resolution and vice versa. Feel free to try out a few different choices, but N = 1600 must be enough.
3. Prepare your data matrix X. You extract the first frame of N samples from the input signal, and apply a Hann window (or any other windows that can overlap-and-add to one). What that means is that from the definition of Hann window , you create a window of size N and element-wise multiply the window and your N audio samples. Place it as your first column vector of the data matrix X. Move by N/2 samples. Extract another frame of N samples and apply the window. This goes to the second column vector of X. Do it for your third frame (which should start from (N + 1)’th sample), and so on. Since you moved just by the half of the frame size, your frames are overlapping each other by 50%.
5. Apply the inverse DFT matrix, which you can also create by using the equation in M2 S12. Let’s call this F∗. Since it’s the inverse transform, F∗F≈ I (you can check it, although the off diagonals might be a very small number rather than zero). You apply this matrix to your spectrogram, which is free from the beep tones, to get back to the recovered version of your data matrix, Xˆ . In theory this should give you a real-valued matrix, but you’ll still see some imaginary parts with a very small value. Ignore them by just taking the real part. Reverse the procedure in P1.3 to get the time domain signal. Basically it must be a procedure that transpose every column vector of Xˆ and overlap-and-add the right half of t-th row vector with the left half of the (t + 1)-th row vector and so on. Listen to the signal to check if the beep tones are gone.
6. Submit your code and the de-beepped audio file, by embedding it into your notebook.
HW2 P2: Parallax [5 points]
1. You live in a planet far away from the earth. Your solar system belongs to a galaxy, which is about to merge with another galaxy (it is not rare in the outer space, but don’t worry, the merger takes a few billions of years). Anyhow, because of this merger, in the deep sky you see lots of stars from your galaxy as well as the other stars in the other neighboring galaxy. Of course you don’t know which one is from which galaxy though.
2. You are going to use a technique called “parallax” to solve this problem. It’s actually very similar to the computer vision algorithm called “stereo matching,” where stereophonic cameras find out the 3D depth information from the visual scene. That’s actually why we humans can recognize the distance of a visual object (we have two eyes). See Figure 1 for an example.
Figure 1: The tree is closer than the mountain. So, from the left camera, the tree is located on the right hand side, while the right camera captures it on the left hand side. On the contrary, the mountain in the back does not have this disparity.
A star in your galaxy
Your planet in Dec.
The other stars in the neighboring galaxy
Dec
Figure 3: The oscillation of the close star (green) in the two pictures. Note that the other stars (blue) don’t move much.
7. Perform k-means clustering on this disparity dataset. Find out the cluster means and report the values. Which one do you think corresponds to the stars in your galaxy and which is for the other galaxy? Why do you think so? Justify your answer in the report.
8. Note that you’re not supposed to use someone else’ implementation. Write up your own code for k-means clustering.
9. One tip for implementing k-means clustering from scratch is to make sure that each cluster contains at least one sample throughout the iterative updates. It’s to avoid any trivial solution, e.g., the algorithm learns a gigantic cluster that contains everything.
HW2 P3: GMM for Parallax [5 points]
1. Implement your own EM algorithm for GMM and propose an alternative solution to the previous parallax problem. Report the estimated means, variances, and prior weights. Explain which one you prefer between the k-means solution and the GMM results. Justify your answer.



Reviews
There are no reviews yet.