Reader small image

You're reading from  Artificial Intelligence with Python - Second Edition

Product typeBook
Published inJan 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839219535
Edition2nd Edition
Languages
Right arrow
Author (1)
Prateek Joshi
Prateek Joshi
author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi

Right arrow

Building a Speech Recognizer

In this chapter, we are going to learn about speech recognition. We will discuss how to work with speech signals and learn how to visualize various audio signals. By utilizing various techniques to process speech signals, we will learn how to build a speech recognition system.

By the end of this chapter, you will know more about:

  • Working with speech signals
  • Visualizing audio signals
  • Transforming audio signals to the frequency domain
  • Generating audio signals
  • Synthesizing tones
  • Extracting speech features
  • Recognizing spoken words

We'll begin by discussing how we can work with speech signals.

Working with speech signals

Speech recognition is the process of understanding the words that are spoken by humans. The speech signals are captured using a microphone and the system tries to understand the words that are being captured. Speech recognition is used extensively in human-computer interaction, smartphones, speech transcription, biometric systems, security, and more.

It is important to understand the nature of speech signals before they are analyzed. These signals happen to be complex mixtures of various signals. There are many different aspects of speech that contribute to its complexity. They include emotion, accent, language, and noise.

Because of this complexity, it is difficult to define a robust set of rules to analyze speech signals. In contrast, humans are outstanding at understanding speech even though it can have so many variations. Humans seem to do it with relative ease. For machines to do the same, we need to help them understand speech the same way...

Visualizing audio signals

Let's see how to visualize an audio signal. We will learn how to read an audio signal from a file and work with it. This will help us understand how an audio signal is structured. When audio files are recorded using a microphone, they are sampling the actual audio signals and storing the digitized versions. The real audio signals are continuous valued waves, which means we cannot store them as they are. We need to sample the signal at a certain frequency and convert it into discrete numerical form.

Most commonly, speech signals are sampled at 44,100 Hz. This means that each second of the speech signal is broken down into 44,100 parts and the values at each of these timestamps is stored in an output file. We save the value of the audio signal every 1/44,100 seconds. In this case, we say that the sampling frequency of the audio signal is 44,100 Hz. By choosing a high sampling frequency, it will appear that the audio signal is continuous when humans...

Transforming audio signals to the frequency domain

In order to analyze audio signals, we need to understand the underlying frequency components. This gives us insights into how to extract meaningful information from this signal. Audio signals are composed of a mixture of sine waves of varying frequencies, phases, and amplitudes.

If we dissect the frequency components, we can identify a lot of characteristics. Any given audio signal is characterized by its distribution in the frequency spectrum. In order to convert a time domain signal into the frequency domain, we need to use a mathematical tool such as the Fourier Transform. If you need a quick refresher on the Fourier Transform, check out this link: http://www.thefouriertransform.com. Let's see how to transform an audio signal from the time domain to the frequency domain.

Create a new Python file and import the following packages:

import numpy as np
import matplotlib.pyplot as plt 
from scipy.io import wavfile

...

Generating audio signals

Now that we know how audio signals work, let's see how we can generate one such signal. We can use the NumPy package to generate various audio signals. Since audio signals are mixtures of sinusoids, we can use this to generate an audio signal with some predefined parameters.

Create a new Python file and import the following packages:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write

Define the output audio file's name:

# Output file where the audio will be saved
output_file = 'generated_audio.wav'

Specify the audio parameters, such as duration, sampling frequency, tone frequency, minimum value, and maximum value:

# Specify audio parameters
duration = 4  # in seconds
sampling_freq = 44100  # in Hz
tone_freq = 784
min_val = -4 * np.pi
max_val = 4 * np.pi

Generate the audio signal using the defined parameters:

# Generate the audio signal
t = np.linspace...

Synthesizing tones to generate music

The previous section described how to generate a simple monotone, but it's not very meaningful. It was just a single frequency through the signal. Let's use that principle to synthesize music by stitching different tones together. We will be using standard tones such as A, C, G, and F to generate music. In order to see the frequency mapping for these standard tones, check out this link: http://www.phy.mtu.edu/~suits/notefreqs.html.

Let's use this information to generate a musical signal.

Create a new Python file and import the following packages:

import json
import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write

Define a function to generate a tone based on the input parameters:

# Synthesize the tone based on the input parameters
def tone_synthesizer(freq, duration, amplitude=1.0, sampling_freq=44100):
    # Construct the time axis
    time_axis = np.linspace(0, duration...

Extracting speech features

We learned how to convert a time domain signal into the frequency domain. Frequency domain features are used extensively in all speech recognition systems. The concept we discussed earlier is an introduction to the idea, but real-world frequency domain features are a bit more complex. Once we convert a signal into the frequency domain, we need to ensure that it's usable in the form of a feature vector. This is where the concept of Mel Frequency Cepstral Coefficients (MFCCs) becomes relevant. MFCC is a tool that's used to extract frequency domain features from a given audio signal.

In order to extract the frequency features from an audio signal, MFCC first extracts the power spectrum. It then uses filter banks and a Discrete Cosine Transform (DCT) to extract the features. If you are interested in exploring MFCCs further, check out this link:

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral...

Recognizing spoken words

Now that we have learned all the techniques to analyze speech signals, let's go ahead and see how to recognize spoken words. Speech recognition systems take audio signals as input and recognize the words being spoken. Hidden Markov Models (HMMs) will be used for this task.

As we discussed in the previous chapter, HMMs are great at analyzing sequential data. An audio signal is a time series signal, which is a manifestation of sequential data. The assumption is that the outputs are being generated by the system going through a series of hidden states. Our goal is to find out what these hidden states are so that we can identify the words in our signal. If you are interested in digging deeper, check out this link: https://web.stanford.edu/~jurafsky/slp3/A.pdf.

We will be using a package called hmmlearn to build our speech recognition system. You can learn more about it here: http://hmmlearn.readthedocs.org/en/latest.

You can install the package by...

Summary

In this chapter, we learned about speech recognition. We discussed how to work with speech signals and the associated concepts. We learned how to visualize audio signals. We talked about how to transform time domain audio signals into the frequency domain using Fourier Transforms. We discussed how to generate audio signals using predefined parameters.

We then used this concept to synthesize music by stitching tones together. We talked about MFCCs and how they are used in the real world. We understood how to extract frequency features from speech. We learned how to use all these techniques to build a speech recognition system. In the next chapter, we will discuss natural language processing and how to use it to analyze text data by modeling and classifying it.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Artificial Intelligence with Python - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781839219535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Prateek Joshi

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.
Read more about Prateek Joshi