Development — Vocal Synthetics

black bod facing right.jpg — Research & Development

*Research Objective:* To create an effective and convincing singing voice synthesis system with a simple process for building new voice styles that could be used by music makers to create new toplines independent of a live singer.

Technological music tools such as digital audio workstations and electronic instruments have enabled musical creation without limitations from performance ability. Many instruments have been transferred to digital space through audio-based sampling and synthetic replication. However, the human singing voice has not yet been effectively represented in a synthetic form. The automation of vocals is particularly difficult because beyond pitch and timbre, the vocalization of language requires additional parameters for control. As the production of a vocal synthesizer and its available voices is complex, the current market sees these difficulties represented through limited products that have unnatural voices and do not adapt to vocal trends. The developments in the field of deep neural networks open up the possibilities of creating a highly realistic vocal synthesizer that is not dependent on large stores of audio. The aim of this project was to bring the human voice into the realm of digital music production, enabling a music maker to include a large range of vocal styles within their production tool set.

Generating natural & high quality synthetic speech has been facilitated by open-source implementations of neural network architectures for text-to-speech synthesis. While it can take weeks to train a model, effective training can generate realistic speech in seconds. As state of the art technology is now available to replicate non-musical speech, these technologies were repurposed for singing.

While neural speech synthesis models often see extensive training data, this system looks to achieve low data requirements through novel solutions. Vocal systems are often built on the concept of malleability in a parametric sense, but not culturally. Creative applications, to be effective in ever-evolving respective creative mediums, need to be malleable by the users themselves. Vocal style is the most important active parameter for experimentation in modern music. The story, the texture, personal stylings, I rank these all above performance virtuosity. Therefore, this system emphasizes a simple process to create new voice presets, over highly configurable parameters. If future vocal synthesis systems are to succeed they must be agile enough to adapt to trends and the sound must not be solely defined by the engineers behind the technology. The sound has to be driven by listeners, artists, and trends.

The audio data used in this project features the voice of artist Maya Malkin, a friend and frequent collaborator.

Prototype 1

Create music using a non-singing synthetic voice.

First Prototype Question: What is needed as output from a standard speech synthesis model in order to create vocal music?

Prototype 1 audio sample.

The scale of audio data needed to train a model for effective speech synthesis means that errors in the creation of a vocal dataset can be expensive in terms of resources spent (hiring a vocalist, time spent recording and prepping data, cost to train models). To mitigate the risk of spending resources unnecessarily, it is important to decide what is actually needed as output from a model in order to create convincing vocal music. To explore this question, I began by attempting to create a topline with regular speech from a previously trained model.

The results sounded satisfactory through a deep dive into vocal production techniques. Simple pitch and rhythm adjustments created a representation of singing.

Prototype 2

To develop a prototype using a dataset of singing at a single pitch.

Second Prototype Question: What parameters of singing will be necessary to create modern popular music? How far back can singing be stripped to a convincing topline?

Prototype 2 audio sample.

In the first prototype, it was seen that pitch and rhythm adjustments could communicate lyrics in a musical style. The question to address at this stage is whether a singing voice synthesis system needs multiple highly adjustable parameters that simulate techniques used in complex vocal performances. The conclusions drawn from the current state of popular music is that we do not need an extreme range of vocal articulations and pitches. An interesting timbre and a few notes might just be enough. Style is more important than demonstrating technical prowess. The first prototype saw extensive pitch stretching, where a complete line of regular speech was tuned to a single pitch. This eliminates the guesswork when re-tuning elements of the phrase to create a melody. This amount of audio manipulation results in unnatural artefacts. To simplify this step, a prototype dataset could be made by singing at a single pitch.

Neural speech synthesis systems have an easier time modelling datasets that are uniform in tone and pitch than those with high divergences such as multi-speaker datasets or single speakers doing characters. If training on a single note at a consistent tempo, models would learn quickly and require less data and it would be easy to build a range of registers and pitches. Single pitch datasets create an opportunity to employ a technique in speech synthesis known as voice cloning. Voice cloning uses parameters learned from a large speech corpus, which are adapted to the smaller dataset. If the model does not need to consider pitch and rhythm variation, then continuing training from a model created using a large non-musical speech corpus is possible. Combining voice cloning with single pitch singing provides a solution to the hours of training data usually needed in text-to-speech systems. This process involves building the dataset to fit an open source speech synthesis framework, rather than attempting to manipulate the technology to fit the data.

Prototype 3

Assemble various models in a complete Singing voice synthesis system.

Third Prototype Question: Once the individual pitch models are integrated into a system, does the system meet my needs for a synthetic vocal solution?

Prototype 3 audio sample.

The first two prototypes helped develop and test a novel approach to create synthetic, singing vocals within synthetic speech technology. The strategy is to use a small dataset of single pitch singing combined with parameters learned from regular speech. The second prototype also helped create guidelines for the minimum data required from the recording stage. The next step was to record multiple pitches, then train separate models on each pitch.

The first step to generating a vocal section is inputting lyrics into each pitch model. The output will be the input text sung at each selected pitch, all following the rhythm of the original recording. Once the audio is formatted, individual syllables can be activated or muted to create a melody.

The system uses a very small amount of data because it employs voice cloning from models that were extensively trained on regular speech. It continues training from learned parameters from large regular speech datasets, and applies parameters to the smaller new dataset. Because the new dataset is highly uniform in pitch and rhythm, the model is able to quickly understand the new data. Low data requirements mean new singers, registers, and articulation styles can be added with ease.

Research & Development

Prototype 1

Prototype 2

Prototype 3

Liam Clarke