Reader small image

You're reading from  fastText Quick Start Guide

Product typeBook
Published inJul 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789130997
Edition1st Edition
Languages
Right arrow
Author (1)
Joydeep Bhattacharjee
Joydeep Bhattacharjee
author image
Joydeep Bhattacharjee

Joydeep Bhattacharjee is a Principal Engineer who works for Nineleaps Technology Solutions. After graduating from National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he stumbled upon machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He also maintains a machine learning blog on Medium.
Read more about Joydeep Bhattacharjee

Right arrow

Machine Learning and Deep Learning Models

In almost all of the applications that we have been discussing up to now, the implicit assumption has been that you are creating a new machine learning NLP pipeline. Now, that may not always be the case. If you are already working on an established platform, fastText may also be a good addition to make the pipeline better.

This chapter will give you some of the methods and recipes for implementing fastText using popular frameworks such as scikit-learn, Keras, TensorFlow, and PyTorch. We will look at how we can augment the power of word embeddings in fastText, using other deep neural architectures such as convolutional neural networks (CNN) or attention networks to solve various NLP problems.

The topics covered in this chapter are as follows:

  • Scikit-learn and fastText
  • Embeddings
  • Keras
  • Embeddings layer in Keras
  • Convolutional neural network...

Scikit-learn and fastText

In this section, we will be talking about how to integrate fastText into your statistical models. The most common and popular library for statistical machine learning is scikit-learn, so we will focus on that.

scikit-learn is one of the most popular machine learning tools and the reason is that the API is very simple and uniform. The flow is like this:

  1. You basically convert your data into matrix format.
  2. Then, you create an instance of the predictor class.
  3. Using the instance, you run the fit method on the data.
  4. Once the model is created, you can run predict on it.

This means that you can create a custom classifier by defining the fit and predict methods.

Custom classifiers for fastText

Since we...

Embeddings

As you have seen, when you need to work with text in machine learning, you need to convert the text into numerical values. The logic is the same in neural architectures as well. In neural networks, you implement this using the embeddings layer. All modern deep learning libraries provide an embeddings API for use.

The embeddings layer is a useful and versatile layer used for various purposes:

  • It can be used to learn word embeddings to be used in an application later
  • It can be used with a larger model where the embeddings are also tuned as part of the model
  • It can be used to load a pretrained word embedding

It is in the third point that will be the focus of this section. The idea is to utilize fastText to create superior embeddings, which can then be injected into your model using this embedding layer. Normally the embeddings layer is initialized with random weights...

Keras

Keras is a widely popular high-level neural network API. It supports TensorFlow, CNTK, and Theano as the backend. Due to the user-friendly API of Keras, many people use it in lieu of the base libraries.

Embedding layer in Keras

The embedding layer will be the first hidden layer of the Keras network and you will need to specify three arguments: input dimension, output dimension, and input length. Since we will be using fastText to make our model better, we will also need to pass the weights parameter with the embedding matrix and make the trainable matrix to be false:

embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
weights=[embedding_matrix],
...

TensorFlow

TensorFlow is a computation library developed by Google. It is quite popular now and is used by many companies to create their neural network models. After what you have seen in Keras, the logic behind augmenting TensorFlow models using fastText is the same.

Word embeddings in TensorFlow

To create word embeddings in TensorFlow, you will need to create an embeddings matrix where all the tokens in your list of documents have unique IDs, and so each document is a vector of these IDs. Now, let's say you have an embedding in a NumPy array called word_embedding, with vocab_size rows and embedding_dim columns, and you want to create a tensor W. Taking a specific example, the sentence "I have a cat." can...

PyTorch

Following the same logic as the previous two libraries, you can use the torch.nn.EmbeddingBag class to inject the pretrained embeddings. There is a small drawback though. Keras and TensorFlow make the assumption that your tensors are actually implemented as NumPy arrays, while in the case of PyTorch, that's not the case. PyTorch implements the torch tensor. Generally, this is not an issue, but this means that you will need to write your own text conversion and tokenizing pipelines. To circumvent all this rewriting and reinvention of the wheel, you can use the torchtext library.

The torchtext library

The torchtext is an excellent library that takes care of most of the preprocessing steps that you need to build...

Summary

In this chapter, we took a look at how to integrate fastText word vectors into either linear machine learning models or deep learning models created in Keras, TensorFlow, and PyTorch. You also saw how word vectors can be easily assimilated into existing neural architectures that you might be using in your business application. If you are initializing the embeddings from random values, I would highly recommend that you try to initialize them using fastText values, and then see whether there are performance improvements in your model.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
fastText Quick Start Guide
Published in: Jul 2018Publisher: PacktISBN-13: 9781789130997
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joydeep Bhattacharjee

Joydeep Bhattacharjee is a Principal Engineer who works for Nineleaps Technology Solutions. After graduating from National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he stumbled upon machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He also maintains a machine learning blog on Medium.
Read more about Joydeep Bhattacharjee