Reader small image

You're reading from  Learning Microsoft Cognitive Services, - Third Edition

Product typeBook
Published inSep 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781789800616
Edition3rd Edition
Languages
Right arrow
Author (1)
Leif Larsen
Leif Larsen
author image
Leif Larsen

Leif Larsen is a software engineer based in Norway. After earning a degree in computer engineering, he went on to work with the design and configuration of industrial control systems, for the most part, in the oil and gas industry. Over the last few years, he has worked as a developer, developing and maintaining geographical information systems, working with .NET technology. Today, he is working with a start-up, developing a brand new SaaS product. In his spare time, he develops mobile apps and explores new technologies to keep up with the high-paced tech world. You can find out more about him by checking out his blog, "Leif Larsen", and following him on Twitter (@leif_larsen) and LinkedIn (lhlarsen).
Read more about Leif Larsen

Right arrow

Chapter 5. Speaking with Your Application

In the previous chapter, we learned how to discover and understand the intent of a user, based on utterances. In this chapter, we will learn how to add audio capabilities to our applications, convert text to speech and speech to text, and learn how to identify the person speaking. Throughout this chapter, we will learn how you can utilize spoken audio to verify a person. Finally, we will briefly touch on how to customize speech recognition to make it unique for your application's usage.

By the end of this chapter, we will have covered the following topics:

  • Converting spoken audio to text and text to spoken audio

  • Recognizing intent from spoken audio by utilizing LUIS

  • Verifying that the speaker is who they claim to be

  • Identifying the speaker

  • Tailoring the Speaker Recognition API to recognize custom speaking styles and environments

Converting text to audio and vice versa


In Chapter 1, Getting Started with Microsoft Cognitive Services, we utilized a part of the Bing Speech API. We gave the example application the ability to say sentences to us. We will use the code that we created in that example now, but we will dive a bit deeper into the details.

We will also go through the other feature of Bing Speech API, that is, converting spoken audio to text. The idea is that we can speak to the smart-house application, which will recognize what we are saying. Using the textual output, the application will use LUIS to gather the intent of our sentence. If LUIS needs more information, the application will politely ask us for more via audio.

To get started, we want to modify the build definition of the smart-house application. We need to specify whether we are running it on a 32-bit or 64-bit OS. To utilize speech-to-text conversion, we want to install the Bing Speech NuGet client package. Search for Microsoft.ProjectOxford.SpeechRecognition...

Knowing who is speaking


Using the Speaker Recognition API, we can identify who is speaking. By defining one or more speaker profiles with corresponding samples, we can identify whether any of them are speaking at any time.

To be able to utilize this feature, we need to go through a few steps:

  1. We need to add one or more speaker profiles to the service.

  2. Each speaker profile enrolls several spoken samples.

  3. We call the service to identify a speaker based on audio input.

If you have not already done so, sign up for an API key for the Speaker Recognition API at https://portal.azure.com.

Start by adding a new NuGet package to your smart-house application. Search for and add Microsoft.ProjectOxford.SpeakerRecognition.

Add a new class called SpeakerIdentification to the Model folder of your project. This class will hold all of the functionality related to speaker identification.

Beneath the class, we will add another class, containing EventArgs for status updates:

    public class SpeakerIdentificationStatusUpdateEventArgs...

Verifying a person through speech


The process of verifying if a person is who they claim to be is quite similar to the identification process. To show how it is done, we will create a new example project, as we do not need this functionality in our smart-house application.

Add the Microsoft.ProjectOxford.SpeakerRecognition and NAudio NuGet packages to the project. We will need the Recording class that we used earlier, so copy this from the smart-house application's Model folder.

Open the MainView.xaml file. We need a few elements in the UI for the example to work. Add a Button element to add speaker profiles. Add two Listbox elements. One will hold available verification phrases while the other will list our speaker profiles.

Add Button elements for deleting a profile, starting and stopping enrollment recording, resetting enrollment, and starting/stopping verification recording.

In the ViewModel, you will need to add two ObservableCollection properties: one of type string, the other of type...

Customizing speech recognition


When we use speech recognition systems, there are several components that are working together. Two of the more important components are acoustic and language models. The first one labels short fragments of audio into sound units. The second helps the system decide the words, based on the likelihood of a given word appearing in certain sequences.

Although Microsoft has done a great job of creating comprehensive acoustic and language models, there may still be times when you need to customize these models.

Imagine that you have an application that is supposed to be used in a factory environment. Using speech recognition will require acoustic training of that environment so that the recognition can separate it from usual factory noises.

Another example is if your application is used by a specific group of people, say, an application for search, where programming is the main topic. You would typically use words such as object-oriented, dot net, or debugging. This...

Translating speech on the fly


Using the Translator Speech API, you can add automatic end-to-end translation for speech. Utilizing this API, one can submit an audio stream of speech and retrieve a textual and audio version of translated text. It uses silent detection to detect when speech has ended. Results will be streamed back once the pause is detected.

For a comprehensive list of supported languages, please visit the following site: https://www.microsoft.com/en-us/translator/business/languages/.

The result recieved from the API, will contain a stream of audio- and text-based results. The results contain the source text in its original language and the translation in the target language.

For a thorough example on how to use the Translator Speech API, please visit the following sample at GitHub: https://github.com/MicrosoftTranslator/SpeechTranslator.

Summary


Throughout this chapter, we have focused on speech. We started by looking at how we can convert spoken audio to text and text to spoken audio. Using this, we modified our LUIS implementation so that we can say commands and have conversations with the smart-house application. From there, we moved on to see how we can identify a person speaking using the Speaker Recognition API. Using the same API, we also learned how to verify that a person is who they claim to be. We briefly looked at the core functionality of the Custom Speech Service. Finally, we briefly covered an introduction to the Translator Speech API.

In the following chapter, we will move back to textual APIs, where we will learn how to explore and analyze text in different ways.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Microsoft Cognitive Services, - Third Edition
Published in: Sep 2018Publisher: PacktISBN-13: 9781789800616
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Leif Larsen

Leif Larsen is a software engineer based in Norway. After earning a degree in computer engineering, he went on to work with the design and configuration of industrial control systems, for the most part, in the oil and gas industry. Over the last few years, he has worked as a developer, developing and maintaining geographical information systems, working with .NET technology. Today, he is working with a start-up, developing a brand new SaaS product. In his spare time, he develops mobile apps and explores new technologies to keep up with the high-paced tech world. You can find out more about him by checking out his blog, "Leif Larsen", and following him on Twitter (@leif_larsen) and LinkedIn (lhlarsen).
Read more about Leif Larsen