To provide a speech section detecting device capable of suppressing the influence of acoustic noise in detecting speech section by a multi-modal speech section detection which comprehensively uses voice information and image information.
The speech section detecting device 100 includes a first multi-modal VAD section 131, which creates a sound and image feature amount combining a sound feature amount and an image feature amount, and which determines a speech section based on the sound and image feature amount; a speech uni-modal VAD section 132 for determining the speech section by using only the sound feature amount; an image uni-modal VAD section 133 for determining the speech section by using only the image feature amount; a second multi-modal VAD section 134 for determining the speech section, by combining the determination of the speech uni-modal VAD section 132 and the image uni-modal section 133; and a third multi-modal section 135 for determining the speech section, by combining the first multi-modal VAD section 131 and the second multi-modal VAD section 134 by a majority decision rule.
TAKEUCHI SHINICHI
HAYAMIZU SATORU
Makoto Onda