Generation of Fundamental Frequency Contours of Mandarin in HMM-based Speech Synthesis using Generation Process Model
Miaomiao Wang, Keikichi Hirose, Nobuaki Minematsu, Department of Electrical Engineering and Information Systems, the University of Tokyo, Tokyo
The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. In this approach, short term spectra, fundamental frequency (F0) and duration are generated by multi-stream HMMs separately. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (VU) decisions are the two key factors in voice quality problems. Pitch tracking errors occur more often in Mandarin vowels of Tone 3 and Tone 4, because the pitch of these vowels can be very low and sometimes treated as aperiodic signal. On the other hand, F0 values in unvoiced regions, such as consonants, are normally defined as unavailable; it is then impossible to use standard HMMs for F0 modeling. Currently a preferred method to solve this is to use a multi-space distribution HMM (MSDHMM). In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSDHMM, the generated F0 values are limited in accuracy. In this paper, an F0 generation process model is used to estimate F0 values in the region of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and accumulated VU posterior probability are used to search for the optimal VU switching point in each VU or UV segment in generation. Then the F0 can be modeled within the standard HMM framework.