Developing apps with the Google Speech APIs

A practical guide to develop advanced and exciting voice applications for Android using open source software

(For more resources related to this topic, see here.)

Speech technology has come of age. If you own a smartphone or tablet you can perform many tasks on your device using voice. For example, you can send a text message, update your calendar, set an alarm, and do many other things with a single spoken command that would take multiple steps to complete using traditional methods such as tapping and selecting. You can also ask the sorts of queries that you would previously have typed into your Google search box and get a spoken response.

For those who wish to develop their own speech-based apps, Google provides APIs for the basic technologies of text-to-speech synthesis (TTS) and automated speech recognition (ASR). Using these APIs developers can create useful interactive voice applications. This article provides a brief overview of the Google APIs and then goes on to show some examples of voice-based apps built around the APIs.

Using the Google text-to-speech synthesis API

TTS has been available on Android devices since Android 1.6 (API Level 4). The components of the Google TTS API (package android.speech.tts) are documented at http://developer.android.com/reference/android/speech/tts/package-summary.html. Interfaces and classes are listed here and further details can be obtained by clicking on these.

Starting the TTS engine involves creating an instance of the TextToSpeech class along with the method that will be executed when the TTS engine is initialized. Checking that TTS has been initialized is done through an interface called OnInitListener. If TTS initialization is complete, the method onInit is invoked.

If TTS has been initialized correctly, the speak method is invoked to speak out some words:

TextToSpeech tts = new TextToSpeech(this, new OnInitListener(){ public void onInit(int status) { if (status == TextToSpeech.SUCCESS) speak(“Hello world”, TextToSpeech.QUEUE_ADD, null); } }

Due to limited storage on some devices, not all languages that are supported may actually be installed on a particular device so it is important to check if a particular language is available before creating the TextToSpeech object. This way, it is possible to download and install the required language-specific resource files if necessary. This is done by sending an Intent with the action ACTION_CHECK_TTS_DATA, which is part of the TextToSpeech.Engine class:

Intent intent = new In-tent(TextToSpeech.Engine.ACTION_CHECK_TTS_DATA); startActivityForResult(intent,TTS_DATA_CHECK);

If the language data has been correctly installed, the onActivityResult handler will receive a CHECK_VOICE_DATA_PASS. If the data is not available, the action ACTION_INSTALL_TTS_DATA will be executed:

Intent installData = new Intent (Engine. ACTION_INSTALL_TTS_DATA); startActivity(installData);

The next figure shows an example of an app using the TTS API.

A potential use-case for this type of app is when the user accesses some text on the Web - for example, a news item, email, or a sports report. This is useful if the user’s eyes and hands are busy, or if there are problems reading the text on the screen. In this example, the app retrieves some text and the user presses the Speak button to hear it. A Stop button is provided in case the user does not wish to hear all of the text.

Using the Google speech recognition API

The components of the Google Speech API (package android.speech) are documented at http://developer.android.com/reference/android/speech/package-summary.html. Interfaces and classes are listed and further details can be obtained by clicking on these.

There are two ways in which speech recognition can be carried out on an Android Device: based solely on a RecognizerIntent, or by creating an instance of SpeechRecognizer.

The following code shows how to start an activity to recognize speech using the first approach:

Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify language model intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, languageModel); // Specify how many results to receive. Results listed in order of confidence intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, numberRecoResults); // Start listening startActivityForResult(intent, ASR_CODE);

The app shown below illustrates the following:

  1. The user selects the parameters for speech recognition.
  2. The user presses a button and says some words.
  3. The words recognized are displayed in a list along with their confidence scores.

Multilingual apps

It is important to be able to develop apps in languages other than English. The TTS and ASR engines can be configured to a wide range of languages. However, we cannot expect that all languages will be available or that they are supported on a particular device. Thus, before selecting a language it is necessary to check whether it is one of the supported languages, and if not, to set the currently preferred language.

In order to do this, a RecognizerIntent.ACTION_GET_LANGUAGE_DETAILS ordered broadcast is sent that returns a Bundle from which the information about the preferred language (RecognizerIntent.EXTRA_LANGUAGE_PREFERENCE) and the list of supported languages (RecognizerIntent.EXTRA_SUPPORTED_LANGUAGES) can be extracted. For speech recognition this introduces a new parameter for the intent in which the language is specified that will be used for recognition, as shown in the following code line:

intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, language);

As shown in the next figure, the user is asked for a language and then the app recognizes what the user says and plays back the best recognition result in the language selected.

Creating a Virtual Personal Assistant (VPA)

Many voice-based apps need to do more than simply speak and understand speech. For example, a VPA also needs the ability to engage in dialog with the user and to perform operations such as connecting to web services and activating device functions. One way to enable these additional capabilities is to make use of chatbot technology (see, for example, the Pandorabots web service: http://www.pandorabots.com/).

The following figure shows two VPAs, Jack and Derek, that have been developed in this way.

Jack is a general-purpose VPA, while Derek is a specialized VPA that can answer questions about Type 2 diabetes, such as symptoms, causes, treatment, risks to children, and complications.

Summary

The Google Speech APIs can be used in countless ways to develop interesting and useful voice-based apps. This article has shown some examples. By building on these you will be able to bring the power of voice to your Android apps, making them smarter and more intuitive, and boosting your users' mobile experience.

Resources for Article:


Further resources on this subject:


Books to Consider

comments powered by Disqus