Text to speech voices. How does it work?
Featured In
Just how do text to speech voices work? We talk a little about the AI technology that turns words into natural sounding voices - on the fly!
While the concept of text to speech - that is to say, computer software that can read the words on a computer screen out loud to the user - is nothing new, it certainly seems to be going through something of a revolution over the last few years.
According to one recent study, the text to speech market was valued at an incredible $2 billion in 2020 - due in part to the impact of the still-ongoing COVID-19 pandemic. Not only that, but it is estimated to grow in value to $5 billion by as soon as 2026 - an impressive compound annual growth rate of 14.6%.
Much of this can be attributed to the ways in which text to speech solutions help those with a wide array of different vision impairments. As per the Centers for Disease Control and prevention, about 12 million people over the age of 40 in the United States have some type of issue processing visual information. Of that number, one million of them are totally blind and eight million have vision-related issues due to some type of uncorrected refractive error. That number is up from 4.2 million in 2012.
All of this is to say that text to speech technology has more than proven its worth over the years. Many solutions like Speechify even offer multiple high quality voices for users to choose from depending on their needs. But how do these solutions work and how are there so many voice options available? The answers to questions like those require you to keep a few important things in mind.
The Inner Workings of Text to Speech
Before you get to the actual voices behind text to speech, however, it's important to come to a better understanding of how these solutions work in the first place.
Text to speech uses artificial intelligence, machine learning and similar subsets of technology to take the written words on a page or screen and convert text into audio content that can then be read out loud. This includes not only the content of a website or something like an article, but also text written in applications like Microsoft Word and others.
The audio content itself is generated entirely by the device being used. In addition to working on desktop and laptop computers, text to speech is also available on nearly every smartphone, tablet or other mobile device available on the market today.
In the vast majority of all solutions, the text to speech processing is handled locally on the device itself. This makes text to speech valuable even if no Internet connection is present.
In addition to allowing people with visual issues to access and digest written content, text to speech is also helpful because the pitch and even the pace of the voice can be controlled. If you want to slow something down so that you can better understand it, you can. Likewise, if you want to speed up the voice to get through content faster, you can do that as well.
Text to Speech Voices: Breaking Things Down
When it comes to the actual voice used by these text to speech solutions, it ultimately all comes down to a concept called a speech synthesizer.
What is a Speech Synthesizer?
Speech synthesis is a form of output that sees your computer (or other device) and reads words aloud in a previously-chosen voice. Conceptually, it's not that dissimilar to reading the words on a page yourself or even printing them out - you're still talking about how the computer is outputting the requested information. Only instead of doing so via text alone, it is doing so via a voice that you can hear through your speakers or headphones.
Generally speaking, speech synthesis works through the solution you're using following a number of basic-yet-important steps. The first of these involves the conversion of text on a page to words.
Step 1: Pre-Processing
At this part of the process, text to speech solutions analyze the words in the content you want to read and take the letters - which are essentially just symbols - and convert them into words. This part of the process is important, as the written word can sometimes be more ambiguous than people realize. Certain words or even phrases can mean multiple things. Likewise, the computer needs to be able to "understand" the difference between words like "their," "there" and "they're" - three words that are pronounced the same but that can dramatically change the context of a sentence.
This is where artificial intelligence and machine learning come into play. With AI, text to speech solutions can be "trained" to eliminate this ambiguity as much as possible. This stage of the text to speech voice process is called "pre-processing," as it is happening "behind the scenes" before the application in question ever reads anything out loud.
This is also the phase where the text to speech solution will differentiate between words that may be spelled the same but that sound differently depending on how they're used. "Read" is a perfect example of this, because it's possible that you may want to read a book this evening to relax even though you've read that book countless times in the past. Humans can easily differentiate between these two ideas given the context - artificial intelligence is employed on the computing side to achieve much the same result.
Equally difficult during this period are things like numbers, abbreviations, acronyms and more. Special characters like the dollar sign are also harder to "translate" than the written word alone. This is why the pre-processing phase is so important - it helps to make sure that everything that will eventually be read out loud actually makes sense in the context through which it was intended.
Step 2: Understanding Pronunciation
Once the text has been analyzed and the text to speech solution "understands" what words must be spoken out loud, the next part of the process begins. This is when those words are then converted into phonemes - essentially, it's learning how to appropriately pronounce the words in the text in question.
This is a part of the process that has evolved dramatically over the years. If you ever had the opportunity to use a text to speech solution from the 1990s (or have watched an older movie from the 1970s or 80s that featured a scene with text to speech), you were probably dealing with a computer voice that didn't sound natural. It was immediately identifiable as being generated by a computer and even though you could understand what it was saying, most words were likely pronounced incorrectly.
Step 3: The Conversion to Speech Begins
Once those phonemes have been identified, the text to speech solution moves onto the final part of the process: converting that information into sound that can be played out loud over a device's speakers or headphones.
This is something that happens in a few different ways depending on the solution that you're using. One of those sees a human actor or actress read a list of phonemes out loud, after which that information is then fed back into the computer and the solution itself. Then, once a specific block of text has been scanned by the application, it can match the phonemes that it finds on the page with the phonemes that have been previously recorded. It then puts those two things together to play back an audio version of text in a far more natural way than ever before.
Some solutions still allow the computer to generate the voice itself. It still operates in much the same way, only the "voice" is not based on previously recorded audio but is simply created by generating specific sound frequencies in the appropriate order.
To that end, it's not entirely dissimilar to the way a music synthesizer might allow a musician to mimic the sounds of instruments using a standard keyboard plugged into a computer. They can play the keyboard like they would the piano, although instead of piano music each key might mimic a different chord on a guitar or sounds from a drum. It's still a computer "understanding" the intent of each key strike and pairing it up with the appropriate sound, albeit in a different context.
Voice Options and Beyond
Part of the reason why there are so many different voice options available in these voice generator text to speech solutions is because they're not actually as difficult to create as a lot of people assume them to be. The types of phonemes needed for an AI voice generator to work are actually quite common throughout the human language. Therefore, all it would take is for an actor or actress to sit in front of a microphone, read a short script containing all of the necessary phonemes, at which point that information can then be fed back into the solution itself.
The AI speech technology will recognize each of the phonemes individually, essentially "breaking" that recording into the sum of its parts and using whichever ones are necessary to accurately generate the text to speech voices necessary when a user is trying to read a website or some other form of content.
Of course, there are many other potential uses for this type of natural sounding voice generator beyond simply helping those with visual impairments. Over the last few years, the public has become very interested in AI speech and voice generation thanks to social media networks like TikTok.
TikTok is actually one of the larger brands that has embraced AI voice generation, allowing users to record videos, put text over those videos and then have speech synthesis read that content out loud. It's a fun way to add an additional layer of immersion to content posted on TikTok and it's one that is only going to get more popular as time goes on.
The Future of Text to Speech Has Arrived
In the end, voice text to speech is an invaluable tool because of what it enables us to do. It allows people with visual issues to enjoy and understand all of the same content that everyone else is, all on their own terms. It can take any blog post, article, document, white paper or other printed content and turn it into an easily consumable audio experience, allowing you to enjoy it not just at home but on your commute, while you're at the gym, etc.
Not only does it make our lives more productive, but it also helps to solve a variety of significant problems like those outlined above. Based on all of that, it's easy to see why speech synthesis and AI speech has become so popular over the last few years in particular.
If you'd like to find out more information about text to speech voices, or if you'd just like to learn more about the ways in which such a solution can benefit your life, please don't delay - try Speechify free today.
Speechify is the #1 rated app in the App store with the most natural sounding speech and user experience with plenty of custom voices.
Speechify is available in a few flavors: for single users, groups, or API for businesses of all sizes.
Tyler Weitzman
Tyler Weitzman is the Co-Founder, Head of Artificial Intelligence & President at Speechify, the #1 text-to-speech app in the world, totaling over 100,000 5-star reviews. Weitzman is a graduate of Stanford University, where he received a BS in mathematics and a MS in Computer Science in the Artificial Intelligence track. He has been selected by Inc. Magazine as a Top 50 Entrepreneur, and he has been featured in Business Insider, TechCrunch, LifeHacker, CBS, among other publications. Weitzman’s Masters degree research focused on artificial intelligence and text-to-speech, where his final paper was titled: “CloneBot: Personalized Dialogue-Response Predictions.”