Packt+ | Advance your knowledge in tech

You're reading from Learning Predictive Analytics with R

Product type Book

Published in Sep 2015

Publisher Packt

ISBN-13 9781782169352

Pages 332 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Author (1):

Eric Mayor

Table of Contents (23) Chapters

Learning Predictive Analytics with R

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Setting GNU R for Predictive Analytics

Visualizing and Manipulating Data Using R

Data Visualization with Lattice

Cluster Analysis

Agglomerative Clustering Using hclust()

Dimensionality Reduction with Principal Component Analysis

Exploring Association Rules with Apriori

Probability Distributions, Covariance, and Correlation

Linear Regression

Classification with k-Nearest Neighbors and Naïve Bayes

Classification Trees

Multilevel Analyses

Text Analytics with R

Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

Exercises and Solutions

Chapter 6. Dimensionality Reduction with Principal Component Analysis

Nowadays, accessing data is easier and cheaper than ever before. This has led to the proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not a trivial task, as its quantity often makes analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (Big Data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). But most organizations do not have such an infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential information intact.

Another reason to reduce dimensionality is that, in some cases, there...

The inner working of Principal Component Analysis

Principal Component Analysis aims at finding the dimensions (principal component) that explain most of the variance in a dataset. Once these components are found, a principal component score is computed for each row and each principal component. Remember the example of the questionnaire data we discussed in the preceding section. These scores can be understood as summaries (combinations) of the attributes that compose the data frame.

PCA produces the principal components by computing the eigenvalues of the covariance matrix of a dataset. There is one eigenvalue for each row in the covariance matrix. The computation of eigenvectors is also required to compute the principal component scores. The eigenvalues and eigenvectors are computed using the following equation, where A is the covariance matrix of interest, I is the identity matrix, k is a positive integer, λ is the eigenvalue and v is the eigenvector:

What is important to understand for...

Learning PCA in R

In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The motivational state questionnaire (msq ) dataset is composed of 92 attributes, of which 72 are ratings of adjectives by 3,896 participants describing their mood. We will only use these 72 attributes for current purpose, which is the exploration of the structure of the questionnaire. We will therefore start by installing and loading the package and the data, and assign the data we are interested in (the mentioned 72 attributes) to an object called motiv:

install.packages("psych")
library(psych)
data(msq)
motiv = msq[,1:72]

Dealing with missing values

Missing values are a common problem in real-life datasets, such as the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing...

Summary

In this chapter, we examined how PCA works. We briefly discussed how to deal with a dataset in cases where most values are missing on some attributes. We examined how to determine the adequate number of components and the proportion of variance they explain. We also saw how to give a meaningful name to the component. Finally, we began examining linear relationships between attributes using correlations. In the next chapter, we will discuss association rules with apriori.