Speech synthesis, or the artificial production of the human voice, has come a long way over the last 70 years. Whether you currently use text-to-speech services to listen to books, study, or proofread your own written work, there’s no doubt that text-to-speech services have made life easier for people in a variety of professions.
Here, we’ll take a look at how text-to-speech processing works, and how the assistive technology has changed over time.
In the 1700s, Russian professor Christian Kratzenstein created acoustic resonators that mimicked the sound of the human voice. Two decades later, the VODER (Voice Operating Demonstrator) made big headlines at the New York World’s Fair when creator Homer Dudley showed crowds how human speech could be created through artificial means. The device was tough to play–Dudley had to control the fundamental frequency using foot pedals.
In the early 1800s, Charles Wheatstone developed the first mechanical speech synthesizer. This kick started a rapid evolution of articulatory synthesis tools and technologies.
It can be tough to pin down exactly what makes a good text-to-speech program, but like many things in life, you know it when you hear it. A high-quality text-to-speech program offers natural-sounding voices with real-life inflection and tone.
Text-to-speech technology can help people who are visually impaired and live with other disabilities get the information they need to thrive at work and to communicate with others. The software also allows students and others with heavy workloads of reading to listen to their information via human speech when they’re on the go. Synthetic speech allows people to get more done in less time, and can be useful in a variety of settings, from video game creation to helping people with language processing differences.
1950s and 60s
In the late 1950s, the first speech synthesis systems were created. This systems were computer-based. In 1961, John Larry Kelly Jr., a physicist at Bell Labs, used an IBM computer to synthesize speech. His vocoder (voice recorder synthesizer) recreated the song Daisy Bell.
At the time that Kelly was perfecting his vocoder, Arthur C. Clarke, author of 2001: A Space Odyessey, used Kelly’s demonstration in his book’s screenplay. During the scene the HAL 9000 computer sings Daisy Bell.
In 1966, linear predictive coding came onto the scene. This form of speech coding began it’s development under Fumitada Itakura and Shuzo Saito. Bishnu S. Atal and Manfred R. Schroeder also contributed to the development of linear predictive coding.
In 1975, the line spectral pairs method was developed by Itakura. This high-compression speech coding method helped Itakura learn more about speech analysis and synthesis, finding weak spots and figuring out how to make them better.
During this year, MUSA was also released. This stand-alone speech synthesis system used an algorithm to read Italian out loud. A version released three years later was able to sing in Italian.
Int he 70s, the first articulatory synthesizer was developed and based on the human vocal tract. The first known synthesizer was developed by Tom Baer, and Paul Mermelstein, and Philip Rubin at Haskins Laboratories. The trio used information from the vocal tract models created at Bell Laboratories in the 60s and 70s.
In 1976, Kurzweil Reading Machines for the Blind were introduced. While these devices were far too expensive for the general public, libraries often provided them for people with visual impairments to listen to books.
Linear predictive coding became the starting point for synthesizer chips. Texas Instruments LPC Speech Chips and the Speak & Spell toys of the late 1970s both used synthesizer chip technology. These toys were examples of human voice synthesis that had accurate intonations, differentiating the voice from the commonly robotic-sounding synthesized voices of the time. Many handheld electronics with the ability to synthesize speech became popular during this decade, including the Telesensory Systems Speech+ calculator for the blind. The Fidelity Voice Chess Challenger, a chess computer that was able to synthesize speech, was released in 1979.
In the 1980s, speech synthesis began to rock the video game world. The 1980 release of Stratovox (a shooting style arcade game) was released by Sun Electronics. Manbiki Shoujo (translated in English to Shoplifting Girl) was the first personal computer game with the ability to synthesize speech. The electronic game Milton was also released in 1980–it was The Milton Bradley Company’s first electronic game that had the ability to synthesize the human voice.
In 1983, the standalone acoustic-mechanical speech machine called DECtalk. DECtalk understood phonetic spellings of words, allowing customized pronunciation of unusual words. These phonetic spellings could also include a tone indicator which DECtalk would use when enunciating the phonetic components. This allowed DECtalk to sing.
In the late 80s, Steve Jobs created NeXT, a system that was developed by Trillium Sound Research. While NeXT didn’t take off, Jobs eventually merged the program with Apple in the 90s.
Earlier versions of synthesized text-to-speech systems sounded distinctly robotic, but that began to change in the late 80s and early 90s. Softer consonants allow speaking machines to lose the electronic edge and sound more human. In 1990, Ann Syrdal at AT&T Bell Laboratories developed a female speech synthesizer voice. Engineers worked to make voices more natural-sounding during the 90s.
In 1999, Microsoft released Narrator, a screen reader solution that is now included in every copy of Microsoft Windows.
Speech synthesis ran into some hiccups during the 2000s, as developers struggled to create agreed-upon standards for synthesized speech. Since speech is highly individual, it’s hard for people around the world to come together and agree on proper pronunciation of phonemes, diphones, intonation, tone, pattern playback, and inflection.
Quality of formant synthesis speech audio also became more of a concern in the 90s, as engineers and researchers noticed that the quality of the systems used in a lab to play back synthesized speech was often far more advanced than the equipment the user had. When thinking of speech synthesis, many people think of Stephen Hawking’s voice synthesizer, which provided a robotic-sounding voice with little human tone.
In 2005, researchers finally came to some agreement and began to use a common speech dataset, allowing them to work from the same basic ideals when creating high-level speech synthesis systems.
In 2007, a study was done showing that listeners can figure out whether a person who is speaking is smiling. Researchers are continuing to work to figure out how to use this information to create speech recognition and speech synthesis software that is more natural.
Today, speech synthesis products that use speech signals are everywhere, from Siri to Alexa. Electronic speech synthesizers don’t just make life easier–they also make life more fun. Whether you’re using a TTS system to listen to novels on the go or you’re using apps that make it easier to learn a foreign language, it’s likely that you’re using text to speech technology to activate your neural networks on a daily basis.
In coming years, it’s likely that voice synthesis technology will focus on creating a model of the brain to better understand how we record speech data in our minds. Speech technology will also work to better understand the role that emotion plays in speech, and will use this information to create AI voices that are indistinguishable from actual humans.
The Latest In Voice Synthesis Technology: Speechify
When learning about transitions from earlier speech synthesis technology, it’s amazing to imagine how far science has come. Today, apps like Speechify make it easy to translate any text into audio files. With just the touch of a button (or tap on an app), Speechify is able to take websites, documents, and images of text and translate them into natural-sounding speech. Speechify’s library syncs across all your devices, making it simple for you to keep learning and working on the go. Check out the Speechify app in both Apple’s App Store and Android’s Google Play.
Who invented text-to-speech?
Text-to-speech for English was invented by Noriko Umeda. The system was developed in the Electrotechnical Laboratory in Japan in 1968.
What is the purpose of text-to-speech?
Many people use text-to-speech technology. For people who prefer to get their information in audio format, TTS technology can make it simple to get the information necessary to work or learn, without having to spend hours in front of a book. Busy professionals also use TTS technology to stay on top of their work when they’re unable to sit in front of a computer screen. Many types of TTS technology were originally developed for people with visual impairments, and TTS is still a fantastic way for people who struggle to see to get the information that they need.
How do you synthesize a speech?
Pieces of recorded speech are stored in a database in various units. Software prepares audio files through unit selection. From there, a voice is created. Often, the larger the output range of a program, the more the program struggles to provide users with vocal clarity.