Cross-Speaker Training and Adaptation for Electromyography-to-Speech Conversion
- PMID: 40039250
- DOI: 10.1109/EMBC53108.2024.10781707
Cross-Speaker Training and Adaptation for Electromyography-to-Speech Conversion
Abstract
Surface Electromyography (EMG) signals of articulatory muscles can be used to synthesize acoustic speech with Electromyography-to-Speech (ETS) models. Recent models have improved the synthesis quality by combining training data from multiple recordings of single speakers. In this work, we evaluated whether using recordings of multiple speakers also increases performance and if cross-speaker models can be adapted to unseen speakers with limited data. We recorded the EMG-Vox corpus, which consists of EMG and audio signals of four speakers with five sessions each. We compared cross-speaker models with single-speaker counterparts and conducted adaptation experiments. Cross-speaker models achieved on average significantly better performance than single-speaker models. Experiments with balanced data indicated that this improvement stemmed from a larger training set. Performing speaker adaptation from cross-speaker models showed higher synthesis quality than training from scratch and was at least on par with session adaptation for most speakers. To the best of our knowledge, this is the first work to report that cross-speaker ETS models yielded better results than single-speaker models.