This chapter will cover the following recipes:
- Using LDA for classification
- Working with QDA – a nonlinear LDA
- Using SGD for classification
- Classifying documents with Naive Bayes
- Label propagation with semi-supervised learning
This chapter will cover the following recipes:
Linear discriminant analysis (LDA) attempts to fit a linear combination of features to predict an outcome variable. LDA is often used as a pre-processing step. We'll walk through both methods in this recipe.
In this recipe, we will do the following:
Before starting on step 1 and grabbing stock data from Google, install a version of pandas that supports the latest stock reader. Do so at an Anaconda command line by typing this:
conda install -c anaconda pandas-datareader
Note...
QDA is the generalization of a common technique such as quadratic regression. It is simply a generalization of a model to allow for more complex models to fit, though, like all things, when allowing complexity to creep in, we make our lives more difficult.
We will expand on the last recipe and look at QDA via the QDA object.
We said we made an assumption about the covariance of the model. Here we will relax that assumption.
The stochastic gradient descent (SGD) is a fundamental technique used to fit a model for regression. There are natural connections between SGD for classification or regression.
In regression, we minimized a cost function that penalized for bad choices on a continuous scale, but for classification, we'll minimize a cost function that penalizes for two (or more) cases.
from sklearn import datasets
X, y = datasets.make_classification(n_samples = 500)
Naive Bayes is a really interesting model. It's somewhat similar to KNN in the sense that it makes some assumptions that might oversimplify reality, but still it performs well in many cases.
In this recipe, we'll use Naive Bayes to do document classification with sklearn. An example I have personal experience of is using a word that makes up an account descriptor in accounting, such as accounts payable, and determining if it belongs to the income statement, cash flow statement, or balance sheet.
The basic idea is to use the word frequency from a labeled test corpus to learn the classifications of the documents. Then, we can turn it on a training set and attempt...
Label propagation is a semi-supervised technique that makes use of labeled and unlabeled data to learn about unlabeled data. Quite often, data that will benefit from a classification algorithm is difficult to label. For example, labeling data might be very expensive, so only a subset is cost-effective to manually label. That said, there does seem to be slow but growing support for companies to hire taxonomists.
Another problem area is censored data. You can imagine a case where the frontier of time will affect your ability to gather labeled data. Say, for instance, you took measurements of patients and gave them an experimental drug. In some cases, you are able...