Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Machine Learning Quick Reference

You're reading from   Machine Learning Quick Reference Quick and essential machine learning hacks for training smart data models

Arrow left icon
Product type Paperback
Published in Jan 2019
Publisher Packt
ISBN-13 9781788830577
Length 294 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
 Kumar Kumar
Author Profile Icon Kumar
Kumar
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Quantifying Learning Algorithms FREE CHAPTER 2. Evaluating Kernel Learning 3. Performance in Ensemble Learning 4. Training Neural Networks 5. Time Series Analysis 6. Natural Language Processing 7. Temporal and Sequential Pattern Discovery 8. Probabilistic Graphical Models 9. Selected Topics in Deep Learning 10. Causal Inference 11. Advanced Methods 12. Other Books You May Enjoy

Dimensionality reduction

Let's talk about a scenario wherein we have been given a dataset from a bank and it has got features pertaining to bank customers. These features comprise customer's income, age, gender, payment behavior, and so on. Once you take a look at the data dimension, you realize that there are 850 features. You are supposed to build a model to predict the customer who is going to default if a loan is given. Would you take all of these features and build the model?

The answer should be a clear no. The more features in a dataset, the more likely it is that the model will overfit. Although having fewer features doesn't guarantee that overfitting won't take place, it reduces the chance of that. Not a bad deal, right?

Dimensionality reduction is one of the ways to deal with this. It implies a reduction of dimensions in the feature space.

There are two ways this can be achieved:

  • Feature elimination: This is a process in which features that are not adding value to the model are rejected. Doing this makes the model quite simple. We know from Occam's Razor that we should strive for simplicity when it comes to building models. However, doing this step may result in the loss of information as a combination of such variables may have an impact on the model.
  • Feature extraction: This is a process in which we create new independent variables that are a combination of existing variables. Based on the impact of these variables, we either keep or drop them.

Principal component analysis is a feature extraction technique that takes all of the variables into account and forms a linear combination of the variables. Later, the least important variable can be dropped while the most important part of that variable is retained.

Newly formed variables (components) are independent of each other, which can be a boon for a model-building process wherein data distribution is linearly separable. Linear models have the underlying assumption that variables are independent of each other.

To understand the functionality of PCA, we have to become familiar with a few terms:

  • Variance: This is the average squared deviation from the mean. It is also called a spread, which measures the variability of the data:

Here, x is the mean.

  • Covariance: This is a measure of the degree to which two variables move in the same direction:

In PCA, we find out the pattern of the data as follows: in the case of the dataset having high covariance when represented in n of dimensions, we represent those dimensions with a linear combination of the same n dimensions. These combinations are orthogonal to each other, which is the reason why they are independent of each other. Besides, dimension follows an order by variance. The top combination comes first.

Let's go over how PCA works by talking about the following steps:

  1. Let's split our dataset into Y and X sets, and just focus on X.
  2. A matrix of X is taken and standardized with a mean of 0 and a standard deviation of 1. Let's call the new matrix Z.
  3. Let's work on Z now. We have to transpose it and multiply the transposed matrix by Z. By doing this, we have got our covariance matrix:

Covariance Matrix = ZTZ

  1. Now, we need to calculate the eigenvalues and their corresponding eigenvectors of ZTZ. Typically, the eigen decomposition of the covariance matrix into PDP⁻¹ is done, where is the matrix of eigenvectors and D is the diagonal matrix with eigenvalues on the diagonal and values of 0 everywhere else.
  2. Take the eigenvalues λ₁, λ₂, …, λp and sort them from largest to smallest. In doing so, sort the eigenvectors in P accordingly. Call this sorted matrix of eigenvectors P*.
  3. Calculate Z*ZP*. This new matrix, Z*, is a centered/standardized version of X, but now each observation is a combination of the original variables, where the weights are determined by the eigenvector. As a bonus, because our eigenvectors in P* are independent of one another, the columns of Z* are independent of one another.
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Machine Learning Quick Reference
You have been reading a chapter from
Machine Learning Quick Reference
Published in: Jan 2019
Publisher: Packt
ISBN-13: 9781788830577
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon