Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from MUSHRA and Similarity Tests

Linda Bakkouche; Stephanie Cooper; Xinbing Luo; Madeleine Rees; Emily Lau; Charles McGhee; Kai Alter; Brechtje Post; Julia Schwarz

doi:10.33774/coe-2024-mvrz5

Education

Search within Education

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from MUSHRA and Similarity Tests

14 November 2024, Version 1

Poster

Show author details

This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Artificial intelligence (AI)-powered voice clones are increasingly utilized in educational and clinical applications, providing valuable tools for platforms such as IELTS, Duolingo, audiobooks, and assistive technologies like ESTA and Socially Assistive Robots. Despite their potential, AI-generated voices often fall short in replicating human-like prosodic features, such as fundamental frequency (F0) and rhythm, crucial for conveying emotion and ensuring intelligibility, particularly in noisy environments. The effects of these prosodic differences on speech perception remain largely unexplored. This study explores three critical questions: (1) how AI-generated speech affects perception accuracy and behavioural performance, (2) whether AI speech prompts neural compensation, as indicated by EEG data, and (3) how variations in prosodic features (F0 and rhythm) influence speech entrainment. We present initial findings from a behavioural pilot study which tested listeners’ perceptions of natural speech, AI voice clones, and prosodically manipulated speech, focusing on ratings of naturalness (Multiple Stimuli with Hidden Reference and Anchor, MUSHRA) and similarity. MUSHRA and Similarity tests were implemented on the Gorilla online experiment platform. Data were collected from 60 native English speakers recruited via Prolific (30 participants per test). Participants compared natural voice samples to those generated by AI models, including Elevenlabs, XTTS, StyleTTS, and a re-synthesized sample with manipulated F0 contours. Results indicate Elevenlabs achieved high ratings, closely approaching human-like naturalness and similarity. In contrast, XTTS and StyleTTS exhibited notable limitations, scoring significantly lower. These findings provide valuable insights for refining AI-generated voices and selecting stimuli for the upcoming EEG experiment on neural responses to AI-generated speech.

Keywords

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Nov 14, 2024 Version 1

Metrics

388

129

Views

Downloads

Citations

License

DOI

10.33774/coe-2024-mvrz5

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) declare that they have sought and gained approval from the relevant ethics committee/IRB for this research and its publication.

Conference

Cambridge Language Sciences Annual Symposium 2024

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from MUSHRA and Similarity Tests

Authors

Abstract

Keywords

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Conference

Share