Measuring Text to Speech Quality: The Practitioner’s Guide to MOS, MUSHRA, PESQ/POLQA & ABX
The rise of text to speech technology has transformed how people consume content, learn, and interact with digital platforms. From audiobooks and e-learning to accessibility tools for people with disabilities, synthetic voices are now a daily part of modern life. But as demand grows, so does the challenge: how do we measure whether text to speech voices sound natural, engaging, and easy to understand?
In this guide, we’ll explore the most widely used evaluation methods—MOS, MUSHRA, PESQ/POLQA, and ABX. We’ll also dive into the ongoing discussion of MUSHRA vs. MOS for text to speech evaluation, providing clarity for researchers, developers, and organizations that want to ensure their text to speech systems meet the highest quality standards.
Why Quality Evaluation Matters in Text to Speech
The effectiveness of text to speech (TTS) goes far beyond simply converting words into audio. Quality impacts accessibility, learning outcomes, productivity, and even trust in the technology.
For example, a poorly tuned text to speech system might sound robotic or unclear, causing frustration for users with dyslexia who rely on it for reading assignments. In contrast, a high-quality TTS system with natural intonation and smooth delivery can transform the same experience into an empowering tool for independence.
Organizations that deploy text to speech—schools, workplaces, healthcare providers, and app developers—must be confident that their systems are reliable. That’s where standardized evaluation methods come in. They provide a structured way to measure audio quality, ensuring that subjective impressions can be captured in a consistent, scientific manner.
Without evaluation, it’s impossible to know if system updates actually improve quality, or if new AI models genuinely enhance the listening experience.
Key Methods for Measuring Text to Speech Quality
1. MOS (Mean Opinion Score)
The Mean Opinion Score (MOS) is a cornerstone of audio evaluation. Originally developed for telecommunication systems, MOS has been widely adopted in text to speech because of its simplicity and familiarity.
In a MOS test, a group of human listeners rates audio clips on a five-point scale, where 1 = Bad and 5 = Excellent. Listeners are asked to consider overall quality, which typically includes clarity, intelligibility, and naturalness.
- Strengths: MOS is easy to set up, inexpensive, and produces results that are widely understood. Because it’s standardized by the International Telecommunication Union (ITU), it’s also trusted across industries.
- Limitations: MOS is coarse-grained. Subtle differences between two high-quality TTS systems may not show up in listener ratings. It also depends heavily on subjective impressions, which can vary by listener background and experience.
For TTS practitioners, MOS is a great starting point. It gives a big-picture view of whether a system sounds “good enough” and allows benchmarking across systems.
2. MUSHRA (Multiple Stimuli with Hidden Reference and Anchor)
MUSHRA is a more advanced evaluation framework created by the ITU to assess intermediate audio quality. Unlike MOS, MUSHRA uses a 0–100 scale and requires listeners to compare multiple samples of the same stimulus.
Each test includes:
- A hidden reference (a high-quality version of the sample).
- One or more anchors (low-quality or degraded versions to set context).
- The text to speech systems under test.
Listeners score each version, resulting in a far more detailed picture of performance.
- Strengths: MUSHRA is highly sensitive to small differences, making it particularly useful for comparing text to speech systems that are close in quality. The inclusion of references and anchors helps listeners calibrate their judgments.
- Limitations: It’s more complex to run. Setting up anchors, references, and multiple samples requires careful design. It also assumes listeners are trained enough to understand the rating task.
For text to speech practitioners, MUSHRA is often the preferred method for fine-tuning models or evaluating incremental improvements.
3. PESQ / POLQA
While MOS and MUSHRA rely on human listeners, PESQ (Perceptual Evaluation of Speech Quality) and its successor POLQA (Perceptual Objective Listening Quality Analysis) are algorithmic measures. They simulate how the human ear and brain perceive audio, allowing for automated testing without human panels.
Originally designed for voice calls and codecs, PESQ and POLQA are useful for large-scale or repeated evaluations where running human studies would be impractical.
- Strengths: They’re fast, repeatable, and objective. Results don’t depend on listener bias or fatigue.
- Limitations: Because they were designed for telephony, they don’t always capture naturalness or expressiveness—two key dimensions in text to speech.
In practice, PESQ/POLQA are often paired with subjective tests like MOS or MUSHRA. This combination gives both scalability and human-validated accuracy.
4. ABX Testing
ABX testing is a simple yet powerful method to evaluate preferences. Listeners are presented with three samples:
- A (text to speech system 1)
- B (text to speech system 2)
- X (matches either A or B)
The listener must decide whether X sounds more like A or B.
- Strengths: ABX is excellent for direct comparisons between two systems. It’s intuitive, easy to run, and works well when testing new models against a baseline.
- Limitations: ABX doesn’t provide absolute quality ratings. It only shows whether listeners prefer one system over another.
In text to speech research, ABX is often used in A/B testing during product development, where developers want to know if new changes are noticeable to users.
MUSHRA vs. MOS for Text to Speech
The MUSHRA vs. MOS debate is one of the most important considerations in text to speech evaluation. Both methods are widely used, but they differ in purpose:
- MOS is best for high-level benchmarking. If a company wants to compare their text to speech system against a competitor or show general quality improvements over time, MOS is simple, efficient, and widely recognized.
- MUSHRA, on the other hand, is best for fine-grained analysis. By using anchors and references, it forces listeners to pay closer attention to differences in audio quality. This makes it particularly valuable for development and research, where small improvements in prosody, pitch, or clarity matter.
In practice: many practitioners use MOS in the early stages to get a baseline, then switch to MUSHRA for detailed testing once systems are close in performance. This layered approach ensures evaluations are both practical and precise.
Best Practices for Text to Speech Practitioners
To get reliable, actionable results from text to speech evaluation:
- Combine methods: Use MOS for benchmarking, MUSHRA for fine-tuning, PESQ/POLQA for scalability, and ABX for preference testing.
- Recruit diverse panels: Listener perception varies by accent, age, and listening experience. A diverse group ensures results reflect real-world audiences.
- Provide context: Evaluate text to speech in the context it will be used (e.g., audiobook vs navigation system). What matters for one scenario may not matter for another.
- Validate with users: At the end of the day, the best measure of quality is whether people can comfortably use the text to speech system for learning, working, or daily life.
Why Speechify Prioritizes Quality in Text to Speech
At Speechify, we know that voice quality makes the difference between a tool that people try once and a tool they rely on daily. That’s why we use a multi-layered evaluation strategy, combining MOS, MUSHRA, PESQ/POLQA, and ABX to measure performance from every angle.
Our process ensures that every new AI voice model is not only technically strong but also comfortable, natural, and engaging for real users. Whether it’s helping a student with dyslexia keep up in school, enabling professionals to multitask with audiobooks, or supporting global learners with multilingual voices, Speechify’s commitment to quality means users can trust the experience.
This dedication reflects our mission: to make text to speech technology inclusive, reliable, and world-class.
Measuring What Matters in Text to Speech
Measuring text to speech quality is both a science and an art. Subjective methods like MOS and MUSHRA capture human impressions, while objective methods like PESQ and POLQA provide scalable insights. ABX tests add preference-based comparisons that are critical in product development.
The MUSHRA vs. MOS debate shows that no single test is enough. For practitioners, the best strategy is to combine methods, validate results with diverse users, and always keep real-world accessibility in mind.
With platforms like Speechify leading in quality evaluation and innovation, the future of text to speech isn’t just intelligible—it’s natural, accessible, and built for everyone.