Packt+ | Advance your knowledge in tech

You're reading from Learning Microsoft Cognitive Services

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781786467843

Pages 372 pages

Edition 1st Edition

Languages

Concepts

Artificial Intelligence

Author (1):

Leif Larsen

Table of Contents (20) Chapters

Learning Microsoft Cognitive Services

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Microsoft Cognitive Services

Analyzing Images to Recognize a Face

Analyzing Videos

Letting Applications Understand Commands

Speak with Your Application

Understanding Text

Extending Knowledge Based on Context

Querying Structured Data in a Natural Way

Adding Specialized Searches

Connecting the Pieces

LUIS Entities and Intents

Additional Information on Linguistic Analysis

License Information

Chapter 5. Speak with Your Application

In the previous chapter, we learned to discover and understand the intent of a user, based on utterances. In this chapter, we will learn how to add audio capabilities to our applications. We will learn to convert text to speech and speech to text. We will learn how to identify the person speaking. Throughout this chapter, we will learn how you can utilize spoken audio to verify a person. Finally, we will touch briefly on how to customize speech recognition, to make it unique for your application's usage.

By the end of this chapter, we will have covered the following topics:

Converting spoken audio to text and text to spoken audio
Recognizing intent from spoken audio, utilizing LUIS
Verifying that the speaker is who they claim to be
Identifying the speaker
Tailoring the recognition API to recognize custom speaking styles and environments

Converting text to audio and vice versa

In Chapter 1, Getting Started with Microsoft Cognitive Services, we utilized a part of the Bing Speech API. We gave the example application the ability to speak sentences to us. We will use the code created in that example now, but we will dive a bit deeper into the details.

We will also go through the other feature of Bing Speech API, converting spoken audio to text. The idea is that we can speak to the Smart-House application, which will recognize what we are saying. Using the textual output, it will use LUIS to get the intent of our sentence. In case LUIS needs more information, the application will politely ask us with audio.

To get started, we want to modify the build definition of the Smart-House application. We need to specify whether we are running it on a 32-bit or a 64-bit OS. To utilize speech-to-text conversion, we want to install the Bing Speech NuGet client package. Search for Microsoft.ProjectOxford.SpeechRecognition, and install either...

Knowing who is speaking

Using the Speaker Recognition API we can identify who is speaking. By defining one or more speaker profiles, with corresponding samples, we can identify if any of these is speaking at any time.

To be able to utilize this feature, we need to go through a few steps:

We add one or more speaker profile to the service.
Each speaker profile enrolls several spoken samples.
We call the service to identify a speaker based on audio input.
Note
If you have not already done so, sign up for an API key for the Speaker Recognition API at https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api .

Start by adding a new NuGet package to your Smart-House application. Search for and add Microsoft.ProjectOxford.SpeakerRecognition.

Add a new class called SpeakerIdentification to the Model folder of your project. This class will hold all the functionality related to speaker identification.

Beneath the class, we add another class, containing EventArgs for status updates:

    public...

Verifying a person through speech

The process of verifying if a person is who they claim to be is quite similar to the identification process. To show how it is done, we will create a new example project, as we do not need this functionality in our Smart-House Application.

Add the Microsoft.ProjectOxford.SpeakerRecognition and NAudio NuGet packages to the project. We will need the Recording class, which we used earlier, so copy this from the Smart-House application's Model folder.

Open the MainView.xaml file. We need a few elements in the UI for the example to work. Add a Button element to add speaker profiles. Add two Listbox elements. One will hold available verification phrases, while the other will list our speaker profiles.

Add Button elements for deleting a profile, starting and stopping enrollment recording, resetting enrollment, and starting/stopping verification recording.

In the ViewModel, you will need to add two ObservableCollection properties: one of type string, the other of type...

Customizing speech recognition

At the time of writing, the Custom Recognition Intelligent Service (CRIS) is still at the private beta stage. As such, we will not spend a lot of time on this, other than going through some key concepts.

When using speech-recognition systems, there are several components working together. Two of the more important components are acoustic and language models. The first one labels short fragments of audio into sound units. The second helps the system decide words, based on the likelihood of a given word appearing in certain sequences.

Although Microsoft have done a great job creating comprehensive acoustic and language models, there may still be times when you need to customize these models.

Imagine you have an application that is supposed to be used in a factory environment. Using speech recognition will require acoustic training of that environment, so that the recognition can separate usual factory noises.

Another example is if your application is used by a specific...

Summary

Throughout this chapter, we have focused on speech. We started by looking at how we can convert spoken audio to text and text to spoken audio. Using this, we modified our LUIS implementation so that we could speak commands and have conversations with the Smart-House Application. From there, we moved on to see how we can identify a person speaking using the Speaker Recognition API. Using the same API, we also learned how to verify that a person is who they claim to be. Finally, we looked briefly at the core functionality of the Custom Recognition Intelligent Service.

In the following chapter, we will move back to textual APIs, where we will learn how to explore and analyze text in different ways.