Reader small image

You're reading from  scikit-learn Cookbook - Second Edition

Product typeBook
Published inNov 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787286382
Edition2nd Edition
Languages
Right arrow
Author (1)
Trent Hauck
Trent Hauck
author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck

Right arrow

Text and Multiclass Classification with scikit-learn

This chapter will cover the following recipes:

  • Using LDA for classification
  • Working with QDA – a nonlinear LDA
  • Using SGD for classification
  • Classifying documents with Naive Bayes
  • Label propagation with semi-supervised learning

Using LDA for classification

Linear discriminant analysis (LDA) attempts to fit a linear combination of features to predict an outcome variable. LDA is often used as a pre-processing step. We'll walk through both methods in this recipe.

Getting ready

In this recipe, we will do the following:

  1. Grab stock data from Google.
  2. Rearrange it in a shape we're comfortable with.
  3. Create an LDA object to fit and predict the class labels.
  4. Give an example of how to use LDA for dimensionality reduction.

Before starting on step 1 and grabbing stock data from Google, install a version of pandas that supports the latest stock reader. Do so at an Anaconda command line by typing this:

conda install -c anaconda pandas-datareader

Note...

Working with QDA – a nonlinear LDA

QDA is the generalization of a common technique such as quadratic regression. It is simply a generalization of a model to allow for more complex models to fit, though, like all things, when allowing complexity to creep in, we make our lives more difficult.

Getting ready

We will expand on the last recipe and look at QDA via the QDA object.

We said we made an assumption about the covariance of the model. Here we will relax that assumption.

How to do it...

  1. QDA is aptly a member of the qda module. Use the following commands to use...

Using SGD for classification

The stochastic gradient descent (SGD) is a fundamental technique used to fit a model for regression. There are natural connections between SGD for classification or regression.

Getting ready

In regression, we minimized a cost function that penalized for bad choices on a continuous scale, but for classification, we'll minimize a cost function that penalizes for two (or more) cases.

How to do it...

  1. First, let's create some very basic data:
from sklearn import datasets
X, y = datasets.make_classification(n_samples = 500)
  1. Split the...

Classifying documents with Naive Bayes

Naive Bayes is a really interesting model. It's somewhat similar to KNN in the sense that it makes some assumptions that might oversimplify reality, but still it performs well in many cases.

Getting ready

In this recipe, we'll use Naive Bayes to do document classification with sklearn. An example I have personal experience of is using a word that makes up an account descriptor in accounting, such as accounts payable, and determining if it belongs to the income statement, cash flow statement, or balance sheet.

The basic idea is to use the word frequency from a labeled test corpus to learn the classifications of the documents. Then, we can turn it on a training set and attempt...

Label propagation with semi-supervised learning

Label propagation is a semi-supervised technique that makes use of labeled and unlabeled data to learn about unlabeled data. Quite often, data that will benefit from a classification algorithm is difficult to label. For example, labeling data might be very expensive, so only a subset is cost-effective to manually label. That said, there does seem to be slow but growing support for companies to hire taxonomists.

Getting ready

Another problem area is censored data. You can imagine a case where the frontier of time will affect your ability to gather labeled data. Say, for instance, you took measurements of patients and gave them an experimental drug. In some cases, you are able...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
scikit-learn Cookbook - Second Edition
Published in: Nov 2017Publisher: PacktISBN-13: 9781787286382
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck