on
Accompaniment Generation for Melodies
This project concerns the creation of a machine learning model capable of adding accompaniments to melodies. The hope is that a user can upload a recording of themselves playing a melody, and then download the recording with a generated accompaniment added on.
Acousitics and Music Theory
Sounds and Waves
A sound is simply a vibration that propogates as an audible wave of pressure through a medium. We can represent sound mathematically as a wave. The simplest wave is arguably a sinusoidal wave, such as $A\cos(wt)$ where $A$ is the amplitude, $\omega$ is the angular frequency, and $t$ is time.
For example, we plot a 2 Hz sinusoidal wave, with an amplitude of 0.5, below:
It turns out that the average human can hear frequencies between 20 Hz to 22 kHz, and so the above wave would actually be inaudible. However, if we change the frequency to be within that range, say 440 Hz, it sounds something like this:
Since $\cos{t}$ is $2\pi$ periodic, the temporal frequency, $f$ of the wave is given by \(f = \frac{\omega}{2\pi}.\)
The percieved pitch of a sound wave is directly proportional to this frequency. However, the percieved pitch difference between two sound waves is logarithmic w.r.t. the frequencies. For example, the pitch difference between a 220 Hz and 440 Hz wave is the same as that between the 440 Hz and an 880 Hz wave.
If we play the 220 Hz, 440 Hz and 880 Hz waves in succession, they sound like this:
Musical Notes
Each pitch interval is seperated by 12 uniformally spaced points forming the 12 notes of the Western scale. The frequency of a note is thus exactly $\sqrt[12]{2}$ times that of the previous note.
Conventionally, we denote these notes (in increasing pitch) with $\mathrm{A}$, $\mathrm{A}^♯$ or $\mathrm{B}^♭$, $\mathrm{B}$, $\mathrm{B}^♯$ or $\mathrm{C}^♭$, $\mathrm{C}$, $\mathrm{C}^♯$ or $\mathrm{D}^♭$, $\mathrm{D}$, $\mathrm{D}^♯$ or $\mathrm{E}^♭$, $\mathrm{E}$, $\mathrm{F}$, $\mathrm{F}^♯$ or $\mathrm{G}^♭$, and $\mathrm{G}$.
Any set of 12 consecutive notes is called an octave. Since music typically consists of notes from multiple octaves, subscripts are commonly used to denote which one a given note belongs to. For example, $\mathrm{G}_4$ refers to the $\mathrm{G}$ note of the 4th octave.
The frequency of $\mathrm{A}_4$ is 440 Hz, and so, the frequency (in Hz) of the $i$th note from $\mathrm{A}_4$, is given by \(440\times 2^{\frac{i}{12}},\) where positive (resp. negative) values for $i$ indicate higher (resp. lower) pitch notes.
Note durations are defined in terms of beats. Commonly used notes and rests are defined in the table below:
Type | Note | Rest | Beats |
---|---|---|---|
Whole |
𝅝 | 𝄻 | $4$ |
Half |
𝅗𝅥 | 𝄼 | $2$ |
Quarter |
𝅘𝅥 | 𝄽 | $1$ |
Eighth |
♪ | 𝄾 | $\frac{1}{2}$ |
Sixteenth |
𝅘𝅥𝅯 | 𝄿 | $\frac{1}{4}$ |
Durations smaller than sixteenth do exist, but are used less often.
The absolute duration of a beat depends on the tempo of the piece, which is usually expressed in beats / min.
The Piano
For the purpose of this project, we will be focusing on just piano music. On a piano, notes correspond to particular keys as shown below:
A standard piano consists of 7 octaves (starting with $\mathrm{C}_1$), plus a few additional notes, namely, $\mathrm{A}_0$, $\mathrm{A}^♯_0$ or $\mathrm{B}^♭_0$, $\mathrm{B}_0$ and $\mathrm{C}_8$, for a total of 88 keys.
Hence, the frequency (in Hz) of the $i$th key is given by \(440\times 2^{\frac{i - 49}{12}}, 1 \leq i \leq 88.\)
Digital Music Representation
There are many ways to represent music digitally. We will consider three, starting with the most raw format, and ending with the most semantic one.
Raw Samples
An obvious way is to simply sample the original wave periodically at some frequency, $f_s$.
For example, we can sample the 2 Hz sinusodial wave from earlier at a frequecy of 4 Hz as shown below:
At first, it appears that a lot of information is being lost in sampling. However, the Nyquist sampling theorem tells us perfect reconstruction is possible provided $f_s$ is at least twice the maximum frequency present in the signal. This minimum frequency is called the Nyquist rate.
Recalling from earlier that the maximum frequency an average human can hear is ~22 kHz, this explains why audio is typically sampled at 44.1 kHz.
For us though, since we are only dealing with piano music, we can get away with a sampling rate of only ~8392 Hz, since the maximum frequency a standard piano can produce is ~4196 Hz.
If a signal is sampled below the Nyquist rate, the signal cannot be uniquely reconstructed. This phenomena is called aliasing, and an example of it can be seen if we sample the 2 Hz wave from before at less than 4 Hz, as shown below:
This representation will not be suitable for training our model.
Musical patterns are very difficult to learn by just seeing raw samples, and even if we could in theory, the representation is extremely space inefficient, making it infeasible to do so in practice.
Piano-Rolls
Of course, with music, a more natural representation can be obtained by plotting the notes that are playing at each time step, such as below:
For a standard piano, such a plot is essentially an $88 \times T$ matrix, $\mathbf{P}$, where $\mathbf{P}_{i,t} \in \lbrace 0, 1 \rbrace$ indicates if the $i$th note is playing at the $t$th time-step, for $1 \leq i \leq 88$ and $1 \leq t \leq T$.
We will refer to such a matrix as as a piano-roll - the one above sounds like this:
The problem with this representation is that the resulting matrix is typically very sparse since any given note is not playing for the vast majority of the time.
While it is arguably easier to see musical patterns in this representation, it still poses a problem for learning - a model could easily learn to never do anything since technically, it would be right most of the time.
Sheet Music
Of course, the most semantic representation of music is probably in the form of sheet music, such as below:
Reading sheet music requires a considerable amount of musical background, but the most important takeway for us is that it essentially encodes music as a sequence of instructions.
Our Representation
For our purposes, sheet music instructions are a bit too nuanced and so we will do something slightly different, while still keeping to its core idea.
First, we will assume the smallest duration possible is $\frac{1}{4}$ of a beat, i.e., a quarter-beat.
Then, we let "0.i"
and "1.i"
respectively denote the “note off” and “note on” instructions for the $i$th piano key, and "w.t"
denote a “wait” instruction, where $t \in \lbrace 1, 2, 4, 8, 16 \rbrace$ is the number of quarter-beats to wait. We will also define two additional markers, "BOS"
and "EOS"
to denote the beginning and end of a sequence.
With 88 notes, we have 183 possible instructions.
In this scheme, the above sheet music could be represented with the following sequence:
"w.4", "1.50", "1.53", "1.58", "w.4", "0.50", "0.53, "0.58" "w.4", "1.50", "1.53", "1.58", "w.4", "0.50", "0.53, "0.58" "w.4", "1.50", "1.53", "1.58", "w.4", "0.50", "0.53, "0.58" "w.4", "1.50", "1.53", "1.58", "w.4", "0.50", "0.53, "0.58"
The character representations above are only for human readability. In practice, each of these instructions will be encoded as a one-hot vector. In this way, we can represent music as a sequence of one-hot vectors, $\mathbf{i}^{(1)},\dots,\mathbf{i}^{(S)} \in \mathbb{1}^{183}$, where $\mathbb{1}^{183}$ is the set of all possible 183-dimensional one-hot vectors.
Actually, we could simply associate each character representation with a numerical index, $k$. The corresponding one-hot vector, $\mathbf{i}$, will be such that $\mathbf{i}_k = 1$ and $\mathbf{i}_j = 0, \forall k \neq j$. In this way, we can represent music as a single vector, $\mathbf{s} \in \mathbb{N}^{S}$.
Both representations will be useful as we will see later. For now, we write some code to convert back and forth between them:
types = ["note_on", "note_off"]
waits = [48, 24, 12, 6, 4, 3, 2, 1]
instruction_set = []
#delimiter
DELIM = "."
#note markers.
NOTE_INS = "n"
NOTE_OFF = "00"
NOTE_ON = "01"
instruction_set.extend([NOTE_INS + DELIM + NOTE_OFF + DELIM + str(note) for note in range(1, 89)])
instruction_set.extend([NOTE_INS + DELIM + NOTE_ON + DELIM + str(note) for note in range(1, 89)])
#rest markers.
WAIT_INS = "w"
for i in range(12):
instruction_set.extend([WAIT_INS + DELIM + str(i).zfill(2) + DELIM + str(wait) for wait in waits])
#sequence markers.
BOS = "BOS"
EOS = "EOS"
instruction_set.extend([BOS, EOS])
instruction_to_index = {instruction: index for index, instruction in enumerate(instruction_set)}
index_to_instruction = {index: instruction for index, instruction in enumerate(instruction_set)}