Exercises – Homework 1 Solved

Description

5/5 – (1 vote)

Global key and local key detection of audio and symbolic music
Li Su
Institute of Information Science, Academia Sinica, Taiwan
lisu@iis.sinica.edu.tw

(Don’t worry if you are not familiar with the music theory of tonality, mode, and key before doing this homework. Instead, taking this assignment will help you learn the theory!)

In this assignment, we will design some global and local key detection algorithms for global and local music key detection for both audio and symbolic data, with full or limited contextual information. You will learn how to extract audio features, how to deal with MIDI data, and the basic music theory and its computational aspects in this assignment.

The concept of a musical key

Let’s start from the notion of the major and minor scales. Denote T as a tone and S a semitone, a major scale is a note sequence represented as T-T-S-T-T-T-S while a minor scale is T-S-T-T-S-T-T. The functions of these seven notes are tonic, supertonic, median, subdominant, dominant, submediant, and leading tone, respectively (see Figure 1). The major and minor scales are the two most commonly seen diatonic scale. If the tonic of a major scale is C, we then call it a C major scale. If the tonic of a minor scale is C, we then call it a C minor scale (see Figures 2 and 3). In Western classical music, there are in general 24 keys (two modes time 12 tonic notes).

A major scale and a minor scale that have the same tonic are called parallel keys. For example, the parallel minor of a C major key is a C minor key. A major scale and a minor scale that have the same key signatures are called relative keys. have For example, the relative minor key of the C major key is the A minor key, the relative minor of E major is C# minor, etc.

(PS: Do not confuse the major/minor key with the major/minor chord. A chord is the co-occurrence of (usually 3) notes, like the major triad and the minor triad, while a key represents the structural information in a diatonic scale.)

Figure 1: the diatonic scale. (Figure from: )

Figure 2: the C major scale.

Figure 3: the C minor scale.
Prerequisite:

The following libraries are suggested for this assignment:
 librosa, a Python library for music and audio signal processing:
 pretty-midi, a Python library for MIDI signal processing:
 mir_eval, a Python library for MIR evaluation:

The following datasets will be used:

The GTZAN dataset and Alexander Lerch’s annotation of key :
[Dataset] https://drive.google.com/open?id=1Xy1AIWa4FifDF6voKVutvmghGOGEsFdZ
[Annotation] https://github.com/alexanderlerch/gtzan_key
Each sample in the GTZAN dataset is a 30-sec clip of music in 10 different genres. We will use the data from the following 9 genres in the dataset for experiment: blues, country, disco, hip-hop, jazz, metal, pop, reggae, and rock.

Schubert Winterreise Dataset (SWD):
[Dataset and Annotation] https://zenodo.org/record/4122060#.YituDHpBy5e
This is a multimodal dataset comprising various representations and annotations of Franz Schubert’s song cycle Winterreise. Schubert’s seminal work constitutes an outstanding example of the Romantic song cycle—a central genre within Western classical music. Some of the versions are unavailable online; our TAs will collect them for you as many as they can.

The GiantStep dataset: [Dataset and Annotation]
https://drive.google.com/drive/folders/1D-PKkNWkWIQYcUDQokdzAFU0EL-oa3lc?usp=sharing

https://drive.google.com/drive/folders/1eS_UUX2MrEbEeTVmiDZwIrSW5VBamNrX

Task 1: Global key detection based on template matching

We assume that the tonic pitch is the one which appears the most often in a music recording. Based on this assumption, the tonic pitch of a music recording can be estimated by the following process:
1. Compute the chromagram 𝐙 = [𝐳1, 𝐳2, … 𝐳i, ⋯ , 𝐳N]. Each 𝐳𝑖 is a 12-dimensional chroma vector at the 𝑖th frame, and 𝑁 is the number of frames in each song.
2. Take average of all the chroma vectors over all the time frames in each song and obtain the songlevel chroma vector x (this process is usually referred to as mean pooling)
𝑁
𝐱 𝐳𝑖
𝑖=1
3. The maximal value of the song-level chroma vector indicates the tonal pitch. For example, if the maximal value of x is at the index of the C note, our estimation of the tonic is C.
4. Based on the estimated tonic, the final step is to find the mode (we consider only major and minor modes in this assignment) with template matching. In this step, the mode is determined by the correlation coefficient 𝑅(𝐱, 𝐲) between 𝐱 and the binary-valued templates 𝐲. For example, if the tonic is C, then we consider two mode templates, one for the C major mode and the other for the C minor mode: 𝐲C Major = [1 0 1 0 1 1 0 1 0 1 0 1] and 𝐲c minor = [1 0 1 1 0 1 0 1 1 0 1 0] (The first index of 𝐲 indicates C note, the second index is C# note, …, and the 12th index is B note.). If we find that 𝑅(𝐱, 𝐲C Major) > 𝑅(𝐱, 𝐲c minor) , the estimated key is then C Major. The correlation coefficient between the song-level chroma and the template is defined as

𝑅

A A# B C C# D D# E F F# G G#
0 1 2 3 4 5 6 7 8 9 10 11
a a# b c c# d d# e f f# g g#
12 13 14 15 16 17 18 19 20 21 22 23

5. For evaluation of key finding algorithm, first, the raw accuracy is defined as:

number of correct detection
ACC =
number of all music pieces in the dataset

The raw accuracy is however unable to resolve the ambiguity in key perception. For example, the C major key is easily to be detected as G major key (a perfect-fifth error), A minor key (a relativemajor/minor error), or C minor key (a parallel-major/minor key), because these erroneous keys are intrinsically “close” to C major keys. To solve this issue, we also consider the weighted score, which gives relative weights to the results having relation to the ground key:

Relation to correct key Points
Same 1.0
Perfect fifth 0.5
Relative major/minor 0.3
Parallel major/minor 0.2
Other 0.0

Therefore, the weighted accuracy is defined as:

# Same + 0.5(# Fifth) + 0.3(# Relative) + 0.2 (Parallel)
ACC =
# of all music pieces in the dataset

You can directly use the evaluation function mir_eval.key.evaluate in the mir_eval library.

Besides the binary-valued templates, let’s also consider other designs of the key templates:

The Krumhansl-Schmuckler key-finding algorithm. A more advanced set of templates for key detection is the Krumhansl-Schmuckler (K-S) profile. Instead of using binary-valued templates, we assign values to the template according to human perceptual experiments. The template values are shown in the following Table (see the columns labeled by K-S). The experiment is done by playing a set of context tones or chords, then playing a probe tone, and asking a listener to rate how well the probe tone fit with the context. In this case, we consider using the correlation coefficient between the input chroma features and the K-S profile for key detection. Notice that the major and minor templates are here rendered by different values, so the templates of the C Major and a minor will not be the same. Therefore, in this case we don’t need to probe the tonic first, but just need to find the maximal correlation coefficient among the major profile, minor profile, and the 12 circular shifts of them, respectively. A web resource http://rnhart.net/articles/key-finding/ demonstrates this idea.

Major key Minor key
Name Binary K-S Name Binary K-S
Tonic 1 6.35 Tonic 1 6.33
0 2.23 0 2.68
Supertonic 1 3.48 Supertonic 1 3.52
0 2.33 Mediant 1 5.38
Mediant 1 4.38 0 2.60
Subdominant 1 4.09 Subdominant 1 3.53
0 2.52 0 2.54
Dominant 1 5.19 Dominant 1 4.75
0 2.39 Submediant 1 3.98
Submediant 1 3.66 0 2.69
0 2.29 Leading tone 1 3.34
Leading tone 1 2.88 0 3.17

The harmonic templates. We assume that the strength of the fundamental frequency of a note is 1, and the strength of the 𝑘th harmonic of a note is 𝛼𝑘, 0 < 𝛼 < 1. Consider the harmonic order to seven, then the chroma template of a single C note is
𝒖𝒄 = (1 + 𝛼 + 𝛼3 + 𝛼7, 0,0,0, 𝛼4, 0,0, 𝛼2 + 𝛼5, 0,0, 𝛼6, 0)
And the template of the C Major key is then 𝒖𝑪 + 𝒖𝑫 + 𝒖𝑬 + 𝒖𝑭 + 𝒖𝑮 + 𝒖𝑨 + 𝒖𝑩. The harmonic templates for the 24 major/minor keys are therefore constructed in this way.

The data-driven templates. All the above templates are determined by our domain knowledge of music. However, construct the template from real-world data (i.e., the machine learning approach) would be expected as the “ultimate” solution because our application scenario is always on the realworld data. To develop data-driven key finding algorithms, practical issues include the size of the data, the consistency between the training and testing data, data imbalance, data augmentation, and overfitting, etc. In this assignment, as a bonus question, we consider an external dataset (the GiantStep dataset) as the training set, and we wish to train the templates from this training set. This process is also known as dictionary learning in the literature of machine learning. Three dictionary learning methods are suggested in this assignment: 1) means of the chroma vectors for each class, 2) random sampling the chroma vectors from each class, and 3) 𝑘-means clustering method. See the description of the bonus question in detail.

Hint: the chroma features can be obtained from the following functions:
• librosa.feature.chroma_stft librosa.feature.chroma_cqt librosa.feature.chroma_cens

Q2 (30%) Repeat the process in Q1 on the MIDI data and all the available audio versions (i.e., HU33, SC06, FI66, FI80) of the Schubert Winterreise Dataset. Report the average raw accuracy and weighted accuracy for each version. Is there any difference among the versions? Are MIDI data easier for key finding? Discuss your results.

Q3 (bonus) Construct the templates for the 24 major/minor keys using the GiantStep dataset. There are many possible ways to construct the templates. There can also be multiple templates for each key.

Task 2: Local key detection

Similarly, the raw accuracy and weighted accuracy of local key finding can be defined as

# of correct detection
ACC =
# of time instances (detections) in all music pieces
# Same + 0.5(# Fifth) + 0.3(# Relative) + 0.2 (Parallel)
ACC =
# of all time instances (detections) in all music pieces

Note that these accuracies count the number of time instances rather than the number of pieces.

Q4 (20%): Based on Task 1, design a local key detector that outputs the key of the music every 0.1 second. That means, there is a key detection output for every time step, and in this task, we set the time step be 0.1 second. Perform your method one the MIDI data and all the available audio versions of the Schubert Winterreise Dataset. For simplicity, let’s evaluate the results against the annotator 1. Report the raw accuracy and the weighted accuracy.

Q5 (10%): The local key detection problem can be regarded as a segmentation problem. There has been evaluation metrics for the segmentation performance in the chord recognition problem, but such metrics have not been applied in local key detection. Please apply the over-segmentation, undersegmentation and average segmentation measures (please refer to the directional Hamming divergence and see page 33 in Lecture 3 slides) on the local key detection of the Schubert Winterreise Dataset.

Hint: these metrics have been implemented somewhere in mir_eval.chord.

Please submit your .zip file containing the report (PDF) and your codes, with the file name
“HW1_[your ID]” to the course website.

Reviews

There are no reviews yet.

Be the first to review “Exercises – Homework 1 Solved”

Exercises – Homework 1 Solved

Description

Reviews

Related products

Exercises: Database Programmability and Transactions Solved

Exercises – Lab: Objects and DOM Solved

Exercises – Lab Assignment Solved

Exercises: Data Definition and Data Types Solved

Exercises: Data Aggregation Solved