Skip to the content.

In this post, we introduce a Korean Singing Voice Synthesis based on Auto-regressive Boundary Equilibrium GAN, accepted for ICASSP 2020.

Overview

Singing voice synthesis is a complicated task that involves multi-dimensional controls of a singer model, including phonemic modulation by lyrics, pitch control by music score, and natural elements such as breath sounds and vibrato expressions. Recently, end-to-end learning models based on GAN have drawn much interest to overcome the limitation of concatenative synthesis and statistical parametric models. When GAN is applied to the audio domain, it entails several issues: the choice of audio representation to generate, handling temporal continuity between two adjacent outputs, finding an effective loss metric for the audio representation. The proposed system addresses the issues using an auto-regressive GAN that generates spectrogram with the boundary equilibrium objective.

Model Architecture Ver 5 small artboard 2

Figure.1 Overview of the proposed singing voice synthesis system.

Auto-Regressive Method

A fundamental issue in the image-based approach when it is applied to audio data is that the model can span only a short audio segment and therefore successive segments generated over time can be discontinuous. To address this problem, we propose an auto-regressive conditional GAN which uses spectrogram in a previous time step as input to produce spectrogram in the current time step. Following figure shows how auto-regressive (AR) method helps generating continuous spectrogram. Without AR method the model generates distinct images of spectrogram but with AR method spectrogram is generated refered to previous spectrogram.

Spectrograms for upload 2

Figure.2 Spectrogram from ground truth and generated spectrograms from the proposed system.

Results

We compared generated samples from the proposed model with ground truth samples and reconstructed samples. The ground truth samples are from original records and the reconstructed samples are processed same as generated samples to evaluate the sound quality loss from signal processing.

Song Generated Reconstruction Ground Truth
작은 별
Twinkle Twinkle
Little Star
거미가 줄을 타고
Itchy Bitsy Spider
퐁당퐁당
Plop Plop
마법의 성
Magic Castle
빙고
Bingo
나비야
Butterfly
알파벳 송
Alphabet Song
솜사탕
Cotten Candy