A Simple Strategy for Natural Mandarin Spoken Word Stretching via the Vocoder
* Presenting author
In Mandarin, when a word is spoken with a longer duration, different regions of it are not stretched uniformly. Moreover, transition between consonants and vowels may be hard to define. Therefore, it is challenging to find the stretchable regions of words automatically. In this paper, we explore the idea of parsing a Mandarin word into the part that should be played with the original speed followed by a uniformly-stretchable region. Then, the optimal dividing point to start stretching could be determined automatically by minimizing the distance between the stretched version (generated by computer) and a ground truth (spoken by human). A database of 42 pairs of regular-speed and slow utterances were created. The dividing points on the regular-speed utterances were found as proposed. The points could be aligned to words with the same pronunciations in full sentences by dynamic time warping, and the sentences could be synthesized with arbitrary tempo and rhythms. The naturalness of stretching by three methods was evaluated subjectively: uniform stretching by waveform-similarity-based synchronized overlap-add (WSOLA), uniform stretching based on linear interpolation in the vocoder domain (LI-VD), and the proposed strategy. 74% of answers chosen by 41 subjects show that our method outperforms.