To include intonation intended by a speaker in a synthetic speech when synthetic voiced speech is generated from non-voiced speech and a lip image.
The non-voiced speech of the speaker and a photographic lip image are synchronously input to generate synthetic voiced speech, in a speech synthesis device. An image signal analysis means extracts vowel information of the voiced speech from the input lip image, and a ratio of lip opening size at vowel pronunciation to a predetermined reference size is extracted as a pitch ratio. A speech signal analysis means extracts consonant information from the input non-voiced speech and a sound model of the non-voiced vowel corresponding to the vowel extracted by the image signal analysis means, and text information is extracted from a built-in dictionary which stores phoneme sequences and words in association with each other, and a language model for calculating the sequence of the word, and a continuation time length of a whole pronunciation from power variation of the input non-voiced speech. A speech synthesis means synthesizes voiced speech with intonation added thereto, based on various information extracted by both analysis means.
COPYRIGHT: (C)2010,JPO&INPIT
JP2006276470A | ||||
JP200068882A | ||||
JP10240283A | ||||
JP2002351489A |