Seminar from Meysam Shamsi, post-doc at LIUM
Localisation: IC2, boardroom
Speaker: Meysam Shamsi
Voice corpus plays a crucial role in the quality of the synthetic speech generation, specially under a length constraint. Creating a new voice is costly and the recording script selection for an expressive TTS task is generally considered as an optimization problem in order to achieve a rich and parsimonious voice corpus.
A Deep Convolutional Neural Network (DCNN) is proposed to project linguistic information to an embedding space. The embedded representation of the corpus is then fed to a selection process to extract a subset of utterances which offers a good linguistic coverage while tending to limit the linguistic unit repetition. We present two selection processes: a clustering approach based on utterance distance and another method that tends to reach a target distribution of linguistic events.
Based on previous results, we simply propose to select shortest utterances of the book. The study of the TTS costs indicates that selecting the shortest utterances could result in better synthetic quality, which is confirmed by a perceptual test.
Afterward, the idea of mixing synthetic and recorded natural speech signals to control the trade-off between the overall quality of audio-book and its production cost is investigated. Firstly, fully synthetic signals and mixed synthetic and natural signals are compared perceptually using different levels of synthetic quality. The listeners’ perception shows that mixed signals are preferred. Next, the order and configuration of mixed signals are studied.