Machine Learning with R


Machine Learning with R
eBook: $32.99
Formats: PDF, PacktLib, ePub and Mobi formats
$28.04
save 15%!
Print + free eBook + free PacktLib access to the book: $87.98    Print cover: $54.99
$54.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Harness the power of R for statistical computing and data science
  • Use R to apply common machine learning algorithms with real-world applications
  • Prepare, examine, and visualize data for analysis
  • Understand how to choose between machine learning models
  • Packed with clear instructions to explore, forecast, and classify data

Book Details

Language : English
Paperback : 396 pages [ 235mm x 191mm ]
Release Date : October 2013
ISBN : 1782162143
ISBN 13 : 9781782162148
Author(s) : Brett Lantz
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source

Table of Contents

Preface
Chapter 1: Introducing Machine Learning
Chapter 2: Managing and Understanding Data
Chapter 3: Lazy Learning – Classification Using Nearest Neighbors
Chapter 4: Probabilistic Learning – Classification Using Naive Bayes
Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules
Chapter 6: Forecasting Numeric Data – Regression Methods
Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines
Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
Chapter 9: Finding Groups of Data – Clustering with k-means
Chapter 10: Evaluating Model Performance
Chapter 11: Improving Model Performance
Chapter 12: Specialized Machine Learning Topics
Index
  • Chapter 1: Introducing Machine Learning
    • The origins of machine learning
    • Uses and abuses of machine learning
      • Ethical considerations
    • How do machines learn?
      • Abstraction and knowledge representation
      • Generalization
      • Assessing the success of learning
    • Steps to apply machine learning to your data
    • Choosing a machine learning algorithm
      • Thinking about the input data
      • Thinking about types of machine learning algorithms
      • Matching your data to an appropriate algorithm
    • Using R for machine learning
      • Installing and loading R packages
        • Installing an R package
        • Installing a package using the point-and-click interface
        • Loading an R package
    • Summary
    • Chapter 2: Managing and Understanding Data
      • R data structures
      • Vectors
      • Factors
        • Lists
        • Data frames
        • Matrixes and arrays
      • Managing data with R
        • Saving and loading R data structures
        • Importing and saving data from CSV files
        • Importing data from SQL databases
      • Exploring and understanding data
        • Exploring the structure of data
        • Exploring numeric variables
          • Measuring the central tendency – mean and median
          • Measuring spread – quartiles and the five-number summary
          • Visualizing numeric variables – boxplots
          • Visualizing numeric variables – histograms
          • Understanding numeric data – uniform and normal distributions
          • Measuring spread – variance and standard deviation
        • Exploring categorical variables
          • Measuring the central tendency – the mode
        • Exploring relationships between variables
          • Visualizing relationships – scatterplots
          • Examining relationships – two-way cross-tabulations
      • Summary
      • Chapter 3: Lazy Learning – Classification Using Nearest Neighbors
        • Understanding classification using nearest neighbors
          • The kNN algorithm
            • Calculating distance
            • Choosing an appropriate k
            • Preparing data for use with kNN
          • Why is the kNN algorithm lazy?
        • Diagnosing breast cancer with the kNN algorithm
          • Step 1 – collecting data
          • Step 2 – exploring and preparing the data
            • Transformation – normalizing numeric data
            • Data preparation – creating training and test datasets
          • Step 3 – training a model on the data
          • Step 4 – evaluating model performance
          • Step 5 – improving model performance
            • Transformation – z-score standardization
            • Testing alternative values of k
        • Summary
        • Chapter 4: Probabilistic Learning – Classification Using Naive Bayes
          • Understanding naive Bayes
            • Basic concepts of Bayesian methods
              • Probability
              • Joint probability
              • Conditional probability with Bayes' theorem
            • The naive Bayes algorithm
              • The naive Bayes classification
              • The Laplace estimator
              • Using numeric features with naive Bayes
          • Example – filtering mobile phone spam with the naive Bayes algorithm
            • Step 1 – collecting data
            • Step 2 – exploring and preparing the data
            • Data preparation – processing text data for analysis
              • Data preparation – creating training and test datasets
              • Visualizing text data – word clouds
              • Data preparation – creating indicator features for frequent words
            • Step 3 – training a model on the data
            • Step 4 – evaluating model performance
            • Step 5 – improving model performance
          • Summary
          • Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules
            • Understanding decision trees
              • Divide and conquer
              • The C5.0 decision tree algorithm
                • Choosing the best split
                • Pruning the decision tree
            • Example – identifying risky bank loans using C5.0 decision trees
              • Step 1 – collecting data
              • Step 2 – exploring and preparing the data
                • Data preparation – creating random training and test datasets
              • Step 3 – training a model on the data
              • Step 4 – evaluating model performance
              • Step 5 – improving model performance
                • Boosting the accuracy of decision trees
                • Making some mistakes more costly than others
            • Understanding classification rules
              • Separate and conquer
              • The One Rule algorithm
              • The RIPPER algorithm
              • Rules from decision trees
            • Example – identifying poisonous mushrooms with rule learners
              • Step 1 – collecting data
              • Step 2 – exploring and preparing the data
              • Step 3 – training a model on the data
              • Step 4 – evaluating model performance
              • Step 5 – improving model performance
            • Summary
            • Chapter 6: Forecasting Numeric Data – Regression Methods
              • Understanding regression
                • Simple linear regression
                • Ordinary least squares estimation
                • Correlations
                • Multiple linear regression
              • Example – predicting medical expenses using linear regression
                • Step 1 – collecting data
                • Step 2 – exploring and preparing the data
                  • Exploring relationships among features – the correlation matrix
                  • Visualizing relationships among features – the scatterplot matrix
                • Step 3 – training a model on the data
                • Step 4 – evaluating model performance
                • Step 5 – improving model performance
                  • Model specification – adding non-linear relationships
                  • Transformation – converting a numeric variable to a binary indicator
                  • Model specification – adding interaction effects
                  • Putting it all together – an improved regression model
              • Understanding regression trees and model trees
                • Adding regression to trees
              • Example – estimating the quality of wines with regression trees and model trees
                • Step 1 – collecting data
                • Step 2 – exploring and preparing the data
                • Step 3 – training a model on the data
                  • Visualizing decision trees
                • Step 4 – evaluating model performance
                  • Measuring performance with mean absolute error
                • Step 5 – improving model performance
              • Summary
              • Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines
                • Understanding neural networks
                  • From biological to artificial neurons
                  • Activation functions
                  • Network topology
                    • The number of layers
                    • The direction of information travel
                    • The number of nodes in each layer
                  • Training neural networks with backpropagation
                • Modeling the strength of concrete with ANNs
                  • Step 1 – collecting data
                  • Step 2 – exploring and preparing the data
                  • Step 3 – training a model on the data
                  • Step 4 – evaluating model performance
                  • Step 5 – improving model performance
                • Understanding Support Vector Machines
                  • Classification with hyperplanes
                  • Finding the maximum margin
                    • The case of linearly separable data
                    • The case of non-linearly separable data
                  • Using kernels for non-linear spaces
                • Performing OCR with SVMs
                  • Step 1 – collecting data
                  • Step 2 – exploring and preparing the data
                  • Step 3 – training a model on the data
                  • Step 4 – evaluating model performance
                  • Step 5 – improving model performance
                • Summary
                • Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
                  • Understanding association rules
                    • The Apriori algorithm for association rule learning
                      • Measuring rule interest – support and confidence
                      • Building a set of rules with the Apriori principle
                  • Example – identifying frequently purchased groceries with association rules
                    • Step 1 – collecting data
                    • Step 2 – exploring and preparing the data
                      • Data preparation – creating a sparse matrix for transaction data
                      • Visualizing item support – item frequency plots
                      • Visualizing transaction data – plotting the sparse matrix
                    • Step 3 – training a model on the data
                    • Step 4 – evaluating model performance
                    • Step 5 – improving model performance
                      • Sorting the set of association rules
                      • Taking subsets of association rules
                      • Saving association rules to a file or data frame
                  • Summary
                  • Chapter 9: Finding Groups of Data – Clustering with k-means
                    • Understanding clustering
                      • Clustering as a machine learning task
                      • The k-means algorithm for clustering
                        • Using distance to assign and update clusters
                        • Choosing the appropriate number of clusters
                      • Finding teen market segments using k-means clustering
                      • Step 1 – collecting data
                      • Step 2 – exploring and preparing the data
                        • Data preparation – dummy coding missing values
                        • Data preparation – imputing missing values
                      • Step 3 – training a model on the data
                      • Step 4 – evaluating model performance
                      • Step 5 – improving model performance
                    • Summary
                    • Chapter 10: Evaluating Model Performance
                      • Measuring performance for classification
                        • Working with classification prediction data in R
                        • A closer look at confusion matrices
                        • Using confusion matrices to measure performance
                        • Beyond accuracy – other measures of performance
                          • The kappa statistic
                          • Sensitivity and specificity
                          • Precision and recall
                          • The F-measure
                        • Visualizing performance tradeoffs
                          • ROC curves
                      • Estimating future performance
                        • The holdout method
                        • Cross-validation
                        • Bootstrap sampling
                      • Summary
                      • Chapter 11: Improving Model Performance
                        • Tuning stock models for better performance
                          • Using caret for automated parameter tuning
                            • Creating a simple tuned model
                            • Customizing the tuning process
                        • Improving model performance with meta-learning
                          • Understanding ensembles
                          • Bagging
                          • Boosting
                          • Random forests
                            • Training random forests
                            • Evaluating random forest performance
                        • Summary
                        • Chapter 12: Specialized Machine Learning Topics
                          • Working with specialized data
                            • Getting data from the Web with the RCurl package
                            • Reading and writing XML with the XML package
                            • Reading and writing JSON with the rjson package
                            • Reading and writing Microsoft Excel spreadsheets using xlsx
                            • Working with bioinformatics data
                            • Working with social network data and graph data
                          • Improving the performance of R
                            • Managing very large datasets
                              • Making data frames faster with data.table
                              • Creating disk-based data frames with ff
                              • Using massive matrices with bigmemory
                            • Learning faster with parallel computing
                              • Measuring execution time
                              • Working in parallel with foreach
                              • Using a multitasking operating system with multicore
                              • Networking multiple workstations with snow and snowfall
                              • Parallel cloud computing with MapReduce and Hadoop
                            • GPU computing
                            • Deploying optimized learning algorithms
                              • Building bigger regression models with biglm
                              • Growing bigger and faster random forests with bigrf
                              • Training and evaluating models in parallel with caret
                          • Summary

                          Brett Lantz

                          Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data.
                          Sorry, we don't have any reviews for this title yet.

                          Code Downloads

                          Download the code and support files for this book.


                          Submit Errata

                          Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                          Errata

                          - 9 submitted: last submission 30 Jul 2014

                          Errata type: grammar | Page number: 39

                          The load() command will recreate any data structures already saved that were to an .RData file.
                          Should be
                          The load() command will recreate any data structures that were already saved to an .RData file.

                          Errata type: Technical | Page number: 45

                          For example, to calculate the mean income in a group of three people with incomes of $35,000, $45,000, and $55,000 we could type:
                          Should be
                          For example, to calculate the mean income in a group of three people with incomes of $36,000, $44,000, and $56,000 we could type:

                          Errata type: Typo | Page number: 51

                          For example, recall that the IQR for the price variable was 3909...
                          Should be
                          For example, recall that the IQR for the price variable was 3905...

                          Errata type: Grammar | Page number: 14

                          Toward this end, the algorithm will employ heuristics, or educated guesses about the where to find the most important concepts.
                          Should be
                          Toward this end, the algorithm will employ heuristics, or educated guesses about where to find the most important concepts.

                          Errata type: Grammar | Page number: 105

                          If we print() the corpus we just created, we will see that it ...
                          Should be
                          If we print the corpus we just created using print() function,....

                          Errata type: Typo | Page number: 27

                          spam message might use the phrase "free ringtones."
                          Should be
                          spam message might use the phrase "free ringtones".

                          Errata type: Typo | Page number: 80

                          Understanding Data, we will split the wcbd_n data frame into the wbcd_train and wbcd_test data frames:
                          Should be
                          Understanding Data, we will split the wbcd_n data frame into the wbcd_train and wbcd_test data frames:

                          Errata type: code | Page number: 41

                          > mydb <- odbcConnect("my_dsn", uid = "my_username" pwd = "my_password")
                          Should be
                          > mydb <- odbcConnect("my_dsn", uid = "my_username", pwd = "my_password")

                          Errata type: Technical | Page number: 93

                          The line : We know that 20 percent of all messages were spam (the left circle), and 5 percent of all messages contained spam (the right circle).
                          Should be
                          We know that 20 percent of all messages were spam (the left circle) and 5 percent of all messages contained the word Viagra (the right circle).

                          Errata type: Technical| Page number: 132

                          The line: To better understand how this function works, note that order(c(0.5, 0.25, 0.75, 0.1)) returns the sequence 4 1 2 3 because the smallest number (0.1) appears fourth, the second smallest (0.25) appears first, and so on.
                          Should be
                          To better understand how this function works, note that order(c(0.5, 0.25, 0.75, 0.1)) returns the sequence 4 2 1 3 because the smallest number (0.1) appears fourth, the second smallest (0.25) appears second, and so on.

                          Errata type: Technical | Page number: 194

                          The line: Install the rpart package using the install.packages(rpart) command. It can then be loaded into your R session using the command library("rpart").
                          Should be
                          Install the rpart package using the install.packages("rpart") command. It can then be loaded into your R session using the command library(rpart).

                          Errata type: Technical| Page number: 178

                          In Line 4:
                          type install packages("psych")
                          Should be:
                          type install.packages("psych")

                          Errata type: Technical | Page number: 258

                          In the table of the page
                          The example code should be:
                          groceryrules <- apriori(groceries, parameter = list(support = 0.006,confidence = 0.25, minlen = 2))
                          inspect(groceryrules[1:3])

                          Errata type: Technical | Page number: 232

                          The Kernal function provide below the table should be:

                          instead of:

                          Errata type: Technical | Page number: 155

                          The columns in the table indicate the true class of the mushroom while the rows in the table indicate the predicted values.

                          Should be:
                          The rows in the table indicate the true class of the mushroom while the columns in the table indicate the predicted values.

                          The 120 values in the lower-left corner indicate mushrooms that are actually edible but were classified as poisonous. On the other hand, there were zero mushrooms that were poisonous but erroneously classified as edible.

                          Should be:

                          The 120 values in the lower-left corner indicate mushrooms that are actually poisonous but were classified as edible. On the other hand, there were zero mushrooms that were edible but erroneously classified as poisonous.

                          Based on this information, it seems that our 1R rule actually plays it safe—if you avoid unappetizing smells when foraging for mushrooms, you will avoid eating any poisonous mushrooms. However, you might pass up some mushrooms that are actually edible. Considering that the learner utilized only a single feature, we did quite well; the publisher of the next  field guide to mushrooms should be very happy. Still, let's see if we can add a few more rules and  develop an even better classifier.

                          Should be:
                          According to our 1R rule, if you avoid unappetizing smells when foraging for mushrooms, you will almost always avoid a trip to the hospital. Considering the simplicity of this rule, the accuracy is quite surprising. However, the publisher of the field guide to mushrooms may not be happy about the fact that some of its readers may fall ill, not to mention the possibility of a lawsuit! Let's see if we can add a few more rules and develop an even better classifier.

                          Errata type: Technical | Page number: 320

                          Regarding the k-fold cross validation, as mentioned in book, to create a train data set we are using credit[folds$Fold01,] and for test data credit[-folds$Fold01,]. These assignments
                          should be opposite as training data must be 90% and test data must be 10% for each of the k-iterations. Similarly, for function(x) used to create cv_results, test and train assignments should be opposite to that mentioned in book.

                          Errata type: Grammar | Page number: 153

                          I you would like to model the relationship between the class y and predictors x1 and
                          x2, you would write the formula as: y ~ x1 + x2.

                          Should be:

                          If you would like to model the relationship between the class y and predictors x1 and
                          x2, you would write the formula as: y ~ x1 + x2.

                          Errata type: Technical | Page number: 133

                          If your results do not match exactly with the previous ones,
                          ensure that you run the command set.seed(214805)
                          immediately prior to creating the  credit_rand data frame.

                          Should be:

                          If your results do not match exactly with the previous ones,
                          ensure that you run the command set.seed(12345)
                          immediately prior to creating the  credit_rand data frame.

                          Errata type: Technical | Chapter 4

                          The "tm" package used in this chapter has been updated since the book was written. For version 0.5-9 of the tm package, the code is correct. However, for the most recent 0.5-10 version of tm (2014-01-13), the Dictionary() function is no longer needed.

                          The code can therefore be revised:
                          > sms_dict <- Dictionary(findFreqTerms(sms_dtm_train, 5))
                          Should be:
                          > sms_dict <- findFreqTerms(sms_dtm_train, 5) 

                          Errata type: Code | Page number: 112

                          x <- factor(x, levels = c(0, 1), labels = c(""No"", ""Yes""))

                          should be:

                          x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))

                          Errata type: typo | Page number: 141

                          classiifed

                          Should be:

                          classified

                          Sample chapters

                          You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                          Frequently bought together

                          Machine Learning with R +    Haskell Data Analysis Cookbook =
                          50% Off
                          the second eBook
                          Price for both: £28.35

                          Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                          What you will learn from this book

                          • Understand the basic terminology of machine learning and how to differentiate among various machine learning approaches
                          • Use R to prepare data for machine learning
                          • Explore and visualize data with R
                          • Classify data using nearest neighbor methods
                          • Learn about Bayesian methods for classifying data
                          • Predict values using decision trees, rules, and support vector machines
                          • Forecast numeric values using linear regression
                          • Model data using neural networks
                          • Find patterns in data using association rules for market basket analysis
                          • Group data into clusters for segmentation
                          • Evaluate and improve the performance of machine learning models
                          • Learn specialized machine learning techniques for text mining, social network data, and “big” data

                          In Detail

                          Machine learning, at its core, is concerned with transforming data into actionable knowledge. This fact makes machine learning well-suited to the present-day era of "big data" and "data science". Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning. Whether you are new to data science or a veteran, machine learning with R offers a powerful set of methods for quickly and easily gaining insight from your data.

                          "Machine Learning with R" is a practical tutorial that uses hands-on examples to step through real-world application of machine learning. Without shying away from the technical details, we will explore Machine Learning with R using clear and practical examples. Well-suited to machine learning beginners or those with experience. Explore R to find the answer to all of your questions.

                          How can we use machine learning to transform data into action? Using practical examples, we will explore how to prepare data for analysis, choose a machine learning method, and measure the success of the process.

                          We will learn how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.

                          "Machine Learning with R" will provide you with the analytical tools you need to quickly gain insight from complex data.

                          Approach

                          Written as a tutorial to explore and understand the power of R for machine learning. This practical guide that covers all of the need to know topics in a very systematic way. For each machine learning approach, each step in the process is detailed, from preparing the data for analysis to evaluating the results. These steps will build the knowledge you need to apply them to your own data science tasks.

                          Who this book is for

                          Intended for those who want to learn how to use R's machine learning capabilities and gain insight from your data. Perhaps you already know a bit about machine learning, but have never used R; or perhaps you know a little R but are new to machine learning. In either case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic programming concepts, but no prior experience is required.

                          Code Download and Errata
                          Packt Anytime, Anywhere
                          Register Books
                          Print Upgrades
                          eBook Downloads
                          Video Support
                          Contact Us
                          Awards Voting Nominations Previous Winners
                          Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                          Resources
                          Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software