Abstract
Artificial intelligence (AI)-powered voice clones are increasingly utilized in educational and clinical applications, providing valuable tools for platforms such as IELTS, Duolingo, audiobooks, and assistive technologies like ESTA and Socially Assistive Robots. Despite their potential, AI-generated voices often fall short in replicating human-like prosodic features, such as fundamental frequency (F0) and rhythm, crucial for conveying emotion and ensuring intelligibility, particularly in noisy environments. The effects of these prosodic differences on speech perception remain largely unexplored.
This study explores three critical questions: (1) how AI-generated speech affects perception accuracy and behavioural performance, (2) whether AI speech prompts neural compensation, as indicated by EEG data, and (3) how variations in prosodic features (F0 and rhythm) influence speech entrainment.
We present initial findings from a behavioural pilot study which tested listeners’ perceptions of natural speech, AI voice clones, and prosodically manipulated speech, focusing on ratings of naturalness (Multiple Stimuli with Hidden Reference and Anchor, MUSHRA) and similarity. MUSHRA and Similarity tests were implemented on the Gorilla online experiment platform. Data were collected from 60 native English speakers recruited via Prolific (30 participants per test). Participants compared natural voice samples to those generated by AI models, including Elevenlabs, XTTS, StyleTTS, and a re-synthesized sample with manipulated F0 contours.
Results indicate Elevenlabs achieved high ratings, closely approaching human-like naturalness and similarity. In contrast, XTTS and StyleTTS exhibited notable limitations, scoring significantly lower. These findings provide valuable insights for refining AI-generated voices and selecting stimuli for the upcoming EEG experiment on neural responses to AI-generated speech.