Comparing human-labeled and AI-labeled speech datasets for TTS

Authors

  • Johannes Wirth Institute of Information Systems at Hof University
  • René Peinl Institute of Information Systems at Hof University, Germany

DOI:

https://doi.org/10.34190/icair.5.1.3030

Keywords:

Text-to-Speech, Dataset Generation, Pseudo Labeling

Abstract

As the output quality of neural networks in the fields of automatic speech recognition (ASR) and text-to-speech (TTS) continues to improve, new opportunities are becoming available to train models in a weakly supervised fashion, thus minimizing the manual effort required to annotate new audio data for supervised training. While weak supervision has recently shown very promising results in the domain of ASR, speech synthesis has not yet been thoroughly investigated regarding this technique despite requiring the equivalent training dataset structure of aligned audio-transcript pairs. In this work, we compare the performance of TTS models trained using a well-curated and manually labeled training dataset to others trained on the same audio data with text labels generated using both grapheme- and phoneme-based ASR models. Phoneme-based approaches seem especially promising, since even for wrongly predicted phonemes, the resulting word is more likely to sound similar to the originally spoken word than for grapheme-based predictions. For evaluation and ranking, we generate synthesized audio outputs from all previously trained models using input texts sourced from a selection of speech recognition datasets covering a wide range of application domains. These synthesized outputs are subsequently fed into multiple state-of-the-art ASR models with their output text predictions being compared to the initial TTS model input texts. This comparison enables an objective assessment of the intelligibility of the audio outputs from all TTS models, by utilizing metrics like word error rate and character error rate. Our results not only show that models trained on data generated with weak supervision achieve comparable quality to models trained on manually labeled datasets, but can outperform the latter, even for small, well-curated speech datasets. These findings suggest that the future creation of labeled datasets for supervised training of TTS models may not require any manual annotation but can be fully automated.

Author Biography

Johannes Wirth, Institute of Information Systems at Hof University

Johannes Wirth has been a research fellow in the System Integration research group at the Institute of Information Systems at Hof University of Applied Sciences since 2020. He has been a PhD student since 2023 and is currently researching various topics in the field of speech recognition and synthesis in German using artificial intelligence. Further research interests lie in the field of Natural Language Understanding.

Downloads

Published

2024-12-04