Reader small image

You're reading from  Mastering Azure Machine Learning

Product typeBook
Published inApr 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781789807554
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Christoph Körner
Christoph Körner
author image
Christoph Körner

Christoph Körner previously worked as a cloud solution architect for Microsoft, specializing in Azure-based big data and machine learning solutions, where he was responsible for designing end-to-end machine learning and data science platforms. He currently works for a large cloud provider on highly scalable distributed in-memory database services. Christoph has authored four books: Deep Learning in the Browser for Bleeding Edge Press, as well as Mastering Azure Machine Learning (first edition), Learning Responsive Data Visualization, and Data Visualization with D3 and AngularJS for Packt Publishing.
Read more about Christoph Körner

Kaijisse Waaijer
Kaijisse Waaijer
author image
Kaijisse Waaijer

Kaijisse Waaijer is an experienced technologist specializing in data platforms, machine learning, and the Internet of Things. Kaijisse currently works for Microsoft EMEA as a data platform consultant specializing in data science, machine learning, and big data. She works constantly with customers across multiple industries as their trusted tech advisor, helping them optimize their organizational data to create better outcomes and business insights that drive value using Microsoft technologies. Her true passion lies within the trading systems automation and applying deep learning and neural networks to achieve advanced levels of prediction and automation.
Read more about Kaijisse Waaijer

View More author details
Right arrow

6. Advanced feature extraction with NLP

In the previous chapters, we learned about many standard transformation and preprocessing approaches within the Azure Machine Learning (ML) service and Azure Machine Learning pipelines. In this chapter, we want to go one step further to extract features from textual and categorical data—a problem that users often face when training ML models.

This chapter will describe the foundations of feature extraction with Natural Language Processing (NLP). This will help you to practically implement semantic embeddings using NLP for your ML pipelines.

First, we will take a look at the differences between textual, categorical, nominal, and ordinal data. This classification will help you to decide the best feature extraction and transformation technique per feature type. Later, we will look at the most common transformations for categorical values, namely label encoding and one-hot encoding. Both techniques will be compared and tested to...

Understanding categorical data

Categorical data comes in many forms, shapes, and meanings. It is extremely important to understand what type of data you are dealing with—is it a string, text, or numeric value disguised as a categorical value? This information is essential for data preprocessing, feature extraction, and model selection.

First, we will take a look at the different types of categorical data—namely ordinal, nominal, and text. Depending on the type, you can use different methods to extract information or other valuable data from it. Please keep in mind that categorical data is ubiquitous, either it is in an ID column, a nominal category, an ordinal category, or a free text field. It's worth mentioning that the more information you have on the data, the easier the preprocessing is.

Next, we will actually preprocess the ordinal and nominal categorical data by transforming it into numerical values. This is a required step when you want to use an ML...

Building a simple bag-of-words model

In this section, we will look at a surprisingly simple concept to tackle the shortcomings of label encoding for textual data with the bag-of-words concept, which will build a foundation for a simple NLP pipeline. Don't worry if these techniques look too simple when you read through it; we will gradually build on top of them with tweaks, optimizations, and improvements to build a modern NLP pipeline.

A naive bag-of-words model using counting

The main concept that we will build in this section is the bag-of-words model. It is a very simple concept; that is, modeling any document as a collection of words that appear in a given document with the frequency of each word. Hence, we throw away sentence structure, word order, punctuation, and so on and reduce the documents to a raw count of words. We can then vectorize this word count into a numeric vector representation, which can then be used for ML, analysis, document comparisons, and much...

Leveraging term importance and semantics

Everything we have done up to now has been relatively simple and based on word stems or so-called tokens. The bag-of-words model was nothing but a dictionary of tokens counting the occurrence of tokens per field. In this section, we will take a look at a common technique to further improve matching between documents using n-gram and skip-gram combinations of terms.

Combining terms in multiple ways will explode your dictionary. This will turn into a problem if you have a large corpus; for example, 10 million words. Hence, we will look at a common preprocessing technique to reduce the dimensionality of a large dictionary through Singular Value Decomposition (SVD).

While this approach is, now, a lot more complicated, it is still based on a bag-of-words model that already works great on a large corpus, in practice. But, of course, we can do better and try to understand the importance of words. Therefore, we will tackle another popular techniqu...

Implementing end-to-end language models

In the previous sections, we trained and concatenated multiple pieces to implement a final algorithm where most of the individual steps need to be trained as well. Lemmatization contains a dictionary of conversion rules. Stop words are stored in the dictionary.

Stemming needs rules for each language and word that the embedding needs to train—tf- idf and SVD are computed only on your training data but independent of each other.

This is a similar problem to the traditional computer vision approach that we will discuss in more depth in Chapter 8, Training deep neural networks on Azure, where many classic algorithms are combined into a pipeline of feature extractors and classifiers. Similar to breakthroughs of end-to-end models trained via gradient descent and backpropagation in computer vision, deep neural networks—especially sequence-to-sequence models—replaced the classical approach, a few years ago.

First, we will...

Summary

In this chapter, you learned how to preprocess textual and categorical nominal and ordinal data using state-of-the-art NLP techniques.

You can now build a classical NLP pipeline with stop-word removal, lemmatization and stemming, n-grams, and count term occurrences using a bag-of-words model. We used SVD to reduce the dimensionality of the resulting feature vector and to generate lower- dimensional topic encoding. One important tweak to the count-based bag-of-words model is to compare the relative term frequencies of a document. You learned about the tf-idf function and can use it to compute the importance of a word in a document compared to the corpus.

In the following section, we looked at Word2Vec and GloVe, pre-trained dictionaries of numeric word embeddings. You can now easily reuse a pre-trained word embedding for commercial NLP applications with great improvements and with accuracy due to the semantic embedding of words.

Finally, we finished the chapter by looking...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Azure Machine Learning
Published in: Apr 2020Publisher: PacktISBN-13: 9781789807554
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Christoph Körner

Christoph Körner previously worked as a cloud solution architect for Microsoft, specializing in Azure-based big data and machine learning solutions, where he was responsible for designing end-to-end machine learning and data science platforms. He currently works for a large cloud provider on highly scalable distributed in-memory database services. Christoph has authored four books: Deep Learning in the Browser for Bleeding Edge Press, as well as Mastering Azure Machine Learning (first edition), Learning Responsive Data Visualization, and Data Visualization with D3 and AngularJS for Packt Publishing.
Read more about Christoph Körner

author image
Kaijisse Waaijer

Kaijisse Waaijer is an experienced technologist specializing in data platforms, machine learning, and the Internet of Things. Kaijisse currently works for Microsoft EMEA as a data platform consultant specializing in data science, machine learning, and big data. She works constantly with customers across multiple industries as their trusted tech advisor, helping them optimize their organizational data to create better outcomes and business insights that drive value using Microsoft technologies. Her true passion lies within the trading systems automation and applying deep learning and neural networks to achieve advanced levels of prediction and automation.
Read more about Kaijisse Waaijer