Evaluation of Hidden Semi-Markov Models Training Methods for Greek Emotional Text-to-Speech Synthesis

Full Text (PDF, 327KB), PP.23-29

Views: 0 Downloads: 0

Author(s)

Alexandros Lazaridis 1,* Iosif Mporas 1,2

1. Dept. of Electrical and Computer Engineering, University of Patras, Rion-Patras 26500, Greece

2. Dept. of Informatics and Mass Media, Technological Educational Institute of Patras, Pyrgos 27100, Greece

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2013.04.03

Received: 21 Jun. 2012 / Revised: 16 Oct. 2012 / Accepted: 20 Dec. 2012 / Published: 8 Mar. 2013

Index Terms

HMM Synthesis, Emotional Synthesis, HSMM Adaptation

Abstract

This paper describes and evaluates four different HSMM (hidden semi-Markov model) training methods for HMM-based synthesis of emotional speech. The first method, called emotion-dependent modelling, uses individual models trained for each emotion separately. In the second method, emotion adaptation modelling, at first a model is trained using neutral speech, and thereafter adaptation is performed to each emotion of the database. The third method, emotion-independent approach, is based on an average emotion model which is initially trained using data from all the emotions of the speech database. Consequently, an adaptive model is build for each emotion. In the fourth method, emotion adaptive training, the average emotion model is trained with simultaneously normalization of the output and state duration distributions. To evaluate these training methods, a Modern Greek speech database which consists of four categories of speech, anger, fear, joy and sadness, was used. Finally, an emotion recognition rate subjective test was performed in order to measure and compare the ability of each of the four approaches in synthesizing emotional speech. The evaluation results showed that the emotion adaptive training achieved the highest emotion recognition rates among four evaluated methods, throughout all four emotions of the database.

Cite This Paper

Alexandros Lazaridis, Iosif Mporas, "Evaluation of Hidden Semi-Markov Models Training Methods for Greek Emotional Text-to-Speech Synthesis", International Journal of Information Technology and Computer Science(IJITCS), vol.5, no.4, pp.23-29, 2013. DOI:10.5815/ijitcs.2013.04.03

Reference

[1]Hunt A, Black A. Unit selection in a concatenative speech synthesis system using a large speech database [C]. In: Proceedings of ICASSP, 1996, 373-376.

[2]Black A W, Cambpbell A W. Optimising selection of units from speech database for concatenative synthesis [C]. In Proceedings of EUROSPEECH, 1995, 581-584.

[3]Clark R A, Richmond K, King S. Multisyn: Opendomain unit selection for the Festival speech synthesis system [J]. Speech Communication, 2007, 49(4):317-330.

[4]Donovan R, Woodland P. A hidden Markov-modelbased trainable speech synthesizer [J]. Computer Speech and Language, 1999, 13(3): 223-241.

[5]Masuko T, Tokuda K, Kobayashi T, Imai S. Speech synthesis using HMMs with dynamic features [C]. In Proceedings of ICASSP, 1996, 389-392.

[6]Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis [C]. In Proceedings of EUROSPEECH, 1999, 2347-2350.

[7]Yamagishi J, Nose T, Zen H, Ling Z H, Toda T, Tokuda K, King S, Renals S. A robust speaker-adaptive HMM-based text-to-speech synthesis [J]. IEEE Trans. Speech, Audio & Language Processing, 2009, 17(6):1208-1230.

[8]Zen H, Toda T, Nakamura M, Tokuda K. Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005 [J]. IEICE Trans. Inf. & Syst, 2007, E90-D(1):325-333.

[9]Zen H, Tokuda K, Black A W. Statistical parametric speech synthesis [J]. Speech Communication, 2009, 51(11):1039-1064.

[10]Fukada T, Tokuda K, Kobayashi T, Imai S. An adaptive algorithm for mel-cepstral analysis of speech [C]. In Proceedings of ICASSP, 1992, 137-140.

[11]Tamura M, Masuko T, Tokuda K, Kobayashi T. Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR [C]. In Proceedings of ICASSP, 2001, 805-808.

[12]Yamagishi J, Masuko T, Kobayashi T. MLLR adaptation for hidden semi-Markov model based speech synthesis [C]. In Proceedings of ICSLP, 2004, 1213-1216.

[13]Gauvain J, Lee C H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains [J]. IEEE Trans. Speech Audio Processing, 1994, 2(2):291-298.

[14]Ogata K, Tachibana M, Yamagishi J, Kobayashi T. Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis [C]. In Proceedings of ICSLP, 2006, 1328-1331.

[15]Nakano Y, Tachibana M, Yamagishi J, Kobayashi T. Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis [C]. In Proceedings of ICSLP, 2006, 2286-2289.

[16]Anastasakos T, McDonough J, Schwartz R, Makhoul J. A compact model for speaker-adaptive training [C]. In Proceedings of ICSLP, 1996, 1137-1140.

[17]Yamagishi J, Kobayashi T. Adaptive training for hidden semi-Markov model [C]. In Proceedings of ICASSP, 2005, 365-368.

[18]Tokuda K, Masuko T, Miyazaki N, Kobayashi T. Hidden markov models based on multi-space probability distribution for pitch pattern modeling [C]. In Proceedings of ICASSP, 1999, 229-232.

[19]Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T. Speech parameter generation algorithm for HMM-based speech synthesis [C]. In Proceedings of ICASSP, 2000, 1315-1318.

[20]Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Hidden semi-Markov model based speech synthesis [C]. In Proceedings of ICSLP, 2004, 1180-1185.

[21]Yamagishi J, Onishi K, Masuko T, Kobayashi T. Modeling of various speaking styles and emotions 

for HMM-based speech synthesis [C]. In Proceedings of EUROSPEECH, 2003, 2461-2464.

[22]Yamagishi J, Tamura M, Masuko T, Tokuda K, Kobayashi T. A context clustering technique for average voice model in HMM-based speech synthesis [C]. In Proceedings of ICSLP, 2002, 133-136.

[23]Yamagishi J, Kobayashi T. Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training [J]. IEICE Trans. Inf. & Syst, 2007, E90-D(2):533-543.

[24]Zervas P, Geourga I, Fakotakis N, Kokkinakis G. Greek Emotional Database: Construction and Linguistic Analysis [C]. In Proceedings of the 6th International Conference of Greek Linguistics, 2003.

[25]Lazaridis A, Kostoulas T, Ganchev T, Mporas I, Fakotakis N. VERGINA: A modern Greek speech database for speech synthesis [C]. In Proceddings of LREC, 2010, 117-121.

[26]Oatley K, Johnson-Laird P. The communicative theory of emotions [B]. In Human Emotions: A Reader, edited by J. Jenkins, et al. Oxford: Blackwell, 1998, 84-87.

[27]Mporas I, Ganchev T, Fakotakis N. A hybrid architecture for automatic segmentation of speech waveforms [C]. In Proceedings of ICASSP, 2008, 4457-4460.

[28]Wells J C. SAMPA computer readable phonetic alphabet [B]. In D., Gibbon, R., Moore, and R., Winski (eds.). Handbook of Standards and Resources for Spoken Language Systems. Berlin and New York: Mouton de Gruyter. Part IV, section B, 1997.

[29]Tokuda K, Zen H, Yamagishi J, Masuko T, Sako S, Black A, Nose T. The HMM-based speech synthesis system (HTS) Version 2.1, 2008, http://hts.sp.nitech.ac.jp/.