Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Predictive Analytics with R

You're reading from  Learning Predictive Analytics with R

Product type Book
Published in Sep 2015
Publisher Packt
ISBN-13 9781782169352
Pages 332 pages
Edition 1st Edition
Languages
Author (1):
Eric Mayor Eric Mayor
Profile icon Eric Mayor

Table of Contents (23) Chapters

Learning Predictive Analytics with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Setting GNU R for Predictive Analytics Visualizing and Manipulating Data Using R Data Visualization with Lattice Cluster Analysis Agglomerative Clustering Using hclust() Dimensionality Reduction with Principal Component Analysis Exploring Association Rules with Apriori Probability Distributions, Covariance, and Correlation Linear Regression Classification with k-Nearest Neighbors and Naïve Bayes Classification Trees Multilevel Analyses Text Analytics with R Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML Exercises and Solutions Further Reading and References Index

Chapter 6. Dimensionality Reduction with Principal Component Analysis

Nowadays, accessing data is easier and cheaper than ever before. This has led to the proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not a trivial task, as its quantity often makes analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (Big Data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). But most organizations do not have such an infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential information intact.

Another reason to reduce dimensionality is that, in some cases, there...

The inner working of Principal Component Analysis


Principal Component Analysis aims at finding the dimensions (principal component) that explain most of the variance in a dataset. Once these components are found, a principal component score is computed for each row and each principal component. Remember the example of the questionnaire data we discussed in the preceding section. These scores can be understood as summaries (combinations) of the attributes that compose the data frame.

PCA produces the principal components by computing the eigenvalues of the covariance matrix of a dataset. There is one eigenvalue for each row in the covariance matrix. The computation of eigenvectors is also required to compute the principal component scores. The eigenvalues and eigenvectors are computed using the following equation, where A is the covariance matrix of interest, I is the identity matrix, k is a positive integer, λ is the eigenvalue and v is the eigenvector:

What is important to understand for...

Learning PCA in R


In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The motivational state questionnaire (msq ) dataset is composed of 92 attributes, of which 72 are ratings of adjectives by 3,896 participants describing their mood. We will only use these 72 attributes for current purpose, which is the exploration of the structure of the questionnaire. We will therefore start by installing and loading the package and the data, and assign the data we are interested in (the mentioned 72 attributes) to an object called motiv:

install.packages("psych")
library(psych)
data(msq)
motiv = msq[,1:72]

Dealing with missing values

Missing values are a common problem in real-life datasets, such as the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing...

Summary


In this chapter, we examined how PCA works. We briefly discussed how to deal with a dataset in cases where most values are missing on some attributes. We examined how to determine the adequate number of components and the proportion of variance they explain. We also saw how to give a meaningful name to the component. Finally, we began examining linear relationships between attributes using correlations. In the next chapter, we will discuss association rules with apriori.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Predictive Analytics with R
Published in: Sep 2015 Publisher: Packt ISBN-13: 9781782169352
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}