Weighted Generative Adversarial Network for many-to-many Voice Conversion
* Presenting author
The goal of voice conversion (VC) is to convert speech from a source speaker to that of a target, without changing phonetic contents. VC usually relies on parallel data for training, which limits its practical applications. Existing approaches are also limited in handling multiple speakers, since different models should be built independently for every speaker pair. To tackle that, variants of Generative Adversarial Network (GAN) were introduced that allows many-to-many mapping instead of learning all the pairwise transformations. Moreover, GAN-VC can handle non-parallel data, i.e., speakers do not need to utter the same sentences. In this paper, we suggest an algorithmic variation of GAN training where suitable weights are multiplied to the gradient of the Generator. Weights are calculated by putting more power to fake samples that fool the Discriminator which results in a stronger Generator. We refer to this variation as weighted-GAN (weGAN). In weGAN, the convergence of the training performance is accelerated. The suggested weGAN-VC achieves over 10% relative improvement against conventional GAN-VC concerning speech objective quality for the same number of epochs while informal subjective quality shows improvement for both speech quality and speaker similarity. Formal listening tests are undergoing and will be reported during the conference.