Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from MUSHRA and Similarity Tests

14 November 2024, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Artificial intelligence (AI)-powered voice clones are increasingly utilized in educational and clinical applications, providing valuable tools for platforms such as IELTS, Duolingo, audiobooks, and assistive technologies like ESTA and Socially Assistive Robots. Despite their potential, AI-generated voices often fall short in replicating human-like prosodic features, such as fundamental frequency (F0) and rhythm, crucial for conveying emotion and ensuring intelligibility, particularly in noisy environments. The effects of these prosodic differences on speech perception remain largely unexplored. This study explores three critical questions: (1) how AI-generated speech affects perception accuracy and behavioural performance, (2) whether AI speech prompts neural compensation, as indicated by EEG data, and (3) how variations in prosodic features (F0 and rhythm) influence speech entrainment. We present initial findings from a behavioural pilot study which tested listeners’ perceptions of natural speech, AI voice clones, and prosodically manipulated speech, focusing on ratings of naturalness (Multiple Stimuli with Hidden Reference and Anchor, MUSHRA) and similarity. MUSHRA and Similarity tests were implemented on the Gorilla online experiment platform. Data were collected from 60 native English speakers recruited via Prolific (30 participants per test). Participants compared natural voice samples to those generated by AI models, including Elevenlabs, XTTS, StyleTTS, and a re-synthesized sample with manipulated F0 contours. Results indicate Elevenlabs achieved high ratings, closely approaching human-like naturalness and similarity. In contrast, XTTS and StyleTTS exhibited notable limitations, scoring significantly lower. These findings provide valuable insights for refining AI-generated voices and selecting stimuli for the upcoming EEG experiment on neural responses to AI-generated speech.

Keywords

AI-generated speech
Voice cloning
Prosody
Speech perception
L2 Speech Learning
EEG

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.