Because of its nature as a fairly new technology in HTML5, the inner workings of text-to-speech are not always understood correctly. What follows is an explanation of what is possible through text-to-speech, how it works (explained in basic English, don’t worry!) and how ResponsiveVoice can help you.
What is speech synthesis?
Speech synthesis is the artificial reproduction of human speech. A text-to-speech system, then, is a system that converts written language into spoken words through speech synthesis.
How does speech synthesis work?
Text-to-speech systems are usually made of two parts: first we have the front-end, which converts symbols (like numbers, or abbreviations) to their written-out counterparts, and also divides the text into sentences, so that even a text without any punctuation will have the pacing you’d expect in a normal conversation. The front-end then assigns phonetic transcriptions (i.e. representation of sound) to each word. After this procedure, called tokenization, the back-end comes into play by converting these phonetic representations into actual sound. This process can also incorporate variations in voice pitch and talking speed.
So you’re basically generating mp3 files and then playing them?
That is incorrect. We’re currently in the third generation of digital text-to-speech systems: generating an audio file was needed in the first and second generations, but it has been now superseded by native speech synthesis (except in an few very specialized cases, which I’ll mention later). Here’s an overview of how the different generations work:
While technologies that convert text into an mp3 file do exist, native text-to-speech synthesizers are simply generating sound based on a previous analysis of a piece of text, much like playing a song by following along its sheet music. This solution is usually preferred because it requires no bandwidth, since by generating the sound on the user’s own machine there’s no need to stream the file through the internet, and saves disk/server space, because no file is created and then saved. This also means that native text-to-speech is much more responsive, as there is no need to wait for a file to be generated (which can take quite a bit when working with a long piece of text).
Is native speech synthesis right for me?
Probably, yes.
Services which create an mp3 file are only useful if you actually need the file, e.g. you want to incorporate it in a bigger audio file, or a videogame, or you want to modify it in some way. In any other case, you’ll do absolutely fine with native speech synthesis. It’s easier to set up and there’s no need to fiddle with files and FTP clients to put your audio online.
But I really need an audio file!
The following services allow you to enter text and then download a spoken audio file of it. There are limitations and variations between each.
- Text2Speech (lots of languages, fairly quick to create the file),
- From text to speech (US/UK English, French, German, Italian, Spanish, Arabic),
- YAKiToMe (lots of languages, but most have a per-word cost),
- NaturalReader (only available in paid versions),
- Listen (English only).
ResponsiveVoice takes you into the future of web speech synthesis, say goodbye to managing MP3 audio files. Text to Speech is instant, there are no per-word costs and native TTS can even work without an internet connection.
Even if you have to use MP3s today, we hope this article has opened your ears to what is possible for your future projects and businesses.