Machine Learning in Audio

Speech Synthesis

Speech synthesis is the artificial production of human speech. Think of what you hear when you listen to Siri or Alexa. There are different implementations of Speech Synthesis, some more successful than others:

Concatenative TTS (Traditional Method)

Record a giant database of sentences from a single voice actor. Divide those into chunks (like words or phrases) that can be reassembled into language based on whatever you want to say. This method can be supplemented by Parametric TTS, which enforces grammatical rules.

Wavenet

This is Google’s Speech Synthesis implementation. Rather than look at words and grammar, it looks at specific audio samples, very similar to the way an AI would look at an image.

It was trained using a CNN to produce tones at 24,000 samples per second, with seamless transitions between those tones to smooth out the gaps between samples. It uses a sample resolution of 16 bits (think of it like pixel density in an image, or DPI). The more you downsample, the more subtleties you lose.

Tonal Based: Because it is tonal based, Wavenet can mimic features of human speech, like lip smacks, pauses, etc. to sound more natural.

ML Driven: Rather than enforcing rules, Wavenet allows the machine learning algorithms to infer the rules from the recordings.

Experimental Music using Audio Spectra:

Using convolutional neural networks trained on various styles of music, artist Memo Akten "morphs" between styles (not crossfading), using various techniques such as smearing and time stretching of the audio spectra across the latent space