Eerily Realistic AI Voice Sparks Amazement and Discomfort Online

“Near-human quality” AI model for speech synthesis

Gavin Purcell, co-host of the AI for Humans podcast, posted an example video on Reddit where the human pretends to be an embezzler and argues with a boss. It’s so dynamic that it’s difficult to tell who the human is and which one is the AI model. Judging by our own demo, it’s entirely capable of what you see in the video.

Sesame’s CSM: A Single-Stage, Multimodal Approach

Under the hood, Sesame’s CSM achieves its realism by using two AI models working together (a backbone and a decoder) based on Meta’s Llama architecture that processes interleaved text and audio. Sesame trained three AI model sizes, with the largest using 8.3 billion parameters (an 8 billion backbone model plus a 300 million parameter decoder) on approximately 1 million hours of primarily English audio.

Single-Stage Processing vs. Two-Stage Approach

Sesame’s CSM doesn’t follow the traditional two-stage approach used by many earlier text-to-speech systems. Instead of generating semantic tokens (high-level speech representations) and acoustic details (fine-grained audio features) in two separate stages, Sesame’s CSM integrates into a single-stage, multimodal transformer-based model, jointly processing interleaved text and audio tokens to produce speech. OpenAI’s voice model uses a similar multimodal approach.

Evaluation Results

In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.

Challenges and Future Directions

Sesame co-founder Brendan Iribe acknowledged current limitations in a comment on Hacker News, noting that the system is “still too eager and often inappropriate in its tone, prosody and pacing” and has issues with interruptions, timing, and conversation flow. “Today, we’re firmly in the valley, but we’re optimistic we can climb out,” he wrote.

Conclusion

Sesame’s CSM has demonstrated impressive capabilities in speech synthesis, achieving near-human quality in isolated speech samples. While there is still room for improvement, the model’s potential for generating realistic and dynamic conversations is vast. As the technology continues to evolve, we can expect to see significant advancements in the field of speech synthesis.

FAQs

Q: What is Sesame’s CSM?

A: Sesame’s CSM is a single-stage, multimodal AI model that generates speech by processing interleaved text and audio tokens.

Q: How does CSM compare to traditional two-stage text-to-speech systems?

A: CSM integrates text and audio processing into a single stage, whereas traditional two-stage approaches separate the process into generating semantic tokens and acoustic details.

Q: What are the limitations of CSM?

A: CSM is still too eager and often inappropriate in its tone, prosody, and pacing, and has issues with interruptions, timing, and conversation flow.

Q: What is the potential of CSM for speech synthesis?

A: The model has the potential to generate realistic and dynamic conversations, with significant advancements expected in the field of speech synthesis as the technology continues to evolve.

Post Views: 18

Eerily Realistic AI Voice Sparks Amazement and Discomfort Online

“Near-human quality” AI model for speech synthesis

Sesame’s CSM: A Single-Stage, Multimodal Approach

Single-Stage Processing vs. Two-Stage Approach

Evaluation Results

Challenges and Future Directions

Conclusion

FAQs

Q: What is Sesame’s CSM?

Q: How does CSM compare to traditional two-stage text-to-speech systems?

Q: What are the limitations of CSM?

Q: What is the potential of CSM for speech synthesis?

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Generate single title from this title Best of MWC 2026: Live updates on phones, concepts, and robots we’re seeing in 100 -150 characters. And...

Featured video: Coding for underwater robotics | MIT News

Generate single title from this title Upgrading agentic AI for finance workflows in 100 -150 characters. And it must return only title i dont...

Generate single title from this title Making Softmax More Efficient with NVIDIA Blackwell Ultra in 100 -150 characters. And it must return only title...

Generate single title from this title Nvidia shares fall as blockbuster results fail to dazzle in 100 -150 characters. And it must return only...

LEAVE A REPLY Cancel reply

Latest

Generate single title from this title Train CodeFu-7B with veRL and Ray on Amazon SageMaker Training jobs in 100 -150 characters. And it must...

Generate single title from this title Nearly half of high school students now use AI in college search in 100 -150 characters. And it...

Engineering confidence to navigate uncertainty | MIT News

Categories

Useful Links

Our Newsletter