Reader small image

You're reading from  fastText Quick Start Guide

Product typeBook
Published inJul 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789130997
Edition1st Edition
Languages
Right arrow
Author (1)
Joydeep Bhattacharjee
Joydeep Bhattacharjee
author image
Joydeep Bhattacharjee

Joydeep Bhattacharjee is a Principal Engineer who works for Nineleaps Technology Solutions. After graduating from National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he stumbled upon machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He also maintains a machine learning blog on Medium.
Read more about Joydeep Bhattacharjee

Right arrow

Sentence Classification in FastText

In this chapter, we will cover the following topics:

  • Sentence classification
  • fastText supervised learning:
    • Architecture
    • Hierarchical softmax architecture
    • N-grams features and the hashing trick:
      • The Fowler-Noll-Vo (FNV) hash
    • Word embeddings and their use in sentence classification
  • fastText model quantization:
    • Compression:
      • Quantization
      • Vector quantization:
        • Finding the codebook for high-dimensional spaces
      • Product quantization
      • Additional steps

Sentence classification

Sentence classification deals with understanding text found in natural languages and determining the classes that it may belong to. In the text classification set of problems, you will have a set of documents d that belongs to the corpus X (which contains all the documents). You will also have a set of finite classes C = {c1 , c2, ..., cn}. Classes are also called categories or labels. To train a model, you would need a classifier, which is generally a well-tested algorithm (not necessary but in this case we will be talking about a well-tested algorithm that is used in fastText) and you will need a corpus with documents and associated labeling identifying the classes that each document belongs to.

Text classification has many practical uses, such as the following:

  • Creating spam classifiers in email
  • Page ranking and indexing in search engines
  • Sentiment...

fastText supervised learning

A fastText classifier is built on top of a linear classifier, specifically a BoW classifier. In this section, you will get to know the architecture of the fastText classifier and how it works.

Architecture

You can consider that each piece of text and each label is actually a vector in space and the coordinates of that vector are what we are actually trying to tweak and train so that the vector for a text and associated label are really close in space:

Vector representation of the text

So, in this example, which is an example shown in 2D space, you have texts that are saying things such as "Nigerian Tommy Thompson is also a relative newcomer to the wrestling scene" and "James...

fastText model quantization

Due to the efforts of the Facebook AI Research team, there is a way to get vastly smaller models (in terms of the size that they take up in the hard drive), as you have seen in the Model quantization section in Chapter 2, Creating Models Using FastText Command Line. Models which take up hundreds of MBs can be quantized to only a couple of MBs. For example, if you see the DBpedia model released by Facebook, which can be accessed at the web page https://fasttext.cc/docs/en/supervised-models.html, notice that the regular model (this is the BIN file) is of 427 MB while the smaller model (the FTZ file) is only 1.7 MB.

This reduction in size is achieved by throwing out some of the information that is encoded in the BIN files (or the bigger model). The problem that needs to be solved here is how to keep information that is important and how to identify information...

Summary

With this chapter, you have completed a deep dive into the theory behind how the fastText model is designed and implemented, the benefits, and the things that you need to consider while implementing it in your ML pipeline.

The next part of the book is about implementation and deployment and we start with how to use fastText in a Python environment in the next chapter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
fastText Quick Start Guide
Published in: Jul 2018Publisher: PacktISBN-13: 9781789130997
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joydeep Bhattacharjee

Joydeep Bhattacharjee is a Principal Engineer who works for Nineleaps Technology Solutions. After graduating from National Institute of Technology at Silchar, he started working in the software industry, where he stumbled upon Python. Through Python, he stumbled upon machine learning. Now he primarily develops intelligent systems that can parse and process data to solve challenging problems at work. He believes in sharing knowledge and loves mentoring in machine learning. He also maintains a machine learning blog on Medium.
Read more about Joydeep Bhattacharjee