Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Machine Learning with R - Third Edition

You're reading from  Machine Learning with R - Third Edition

Product type Book
Published in Apr 2019
Publisher Packt
ISBN-13 9781788295864
Pages 458 pages
Edition 3rd Edition
Languages
Author (1):
Brett Lantz Brett Lantz
Profile icon Brett Lantz

Table of Contents (18) Chapters

Machine Learning with R - Third Edition
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
Introducing Machine Learning Managing and Understanding Data Lazy Learning – Classification Using Nearest Neighbors Probabilistic Learning – Classification Using Naive Bayes Divide and Conquer – Classification Using Decision Trees and Rules Forecasting Numeric Data – Regression Methods Black Box Methods – Neural Networks and Support Vector Machines Finding Patterns – Market Basket Analysis Using Association Rules Finding Groups of Data – Clustering with k-means Evaluating Model Performance Improving Model Performance Specialized Machine Learning Topics Index

Chapter 6. Forecasting Numeric Data – Regression Methods

Mathematical relationships help us to make sense of many aspects of everyday life. For example, body weight is a function of one's calorie intake; income is often related to education and job experience; and poll numbers help to estimate a presidential candidate's odds of being re-elected.

When such patterns are formulated with numbers, we gain additional clarity. For example, an additional 250 kilocalories consumed daily may result in nearly a kilogram of weight gain per month; each year of job experience may be worth an additional $1,000 in yearly salary; and a president is more likely to be re-elected when the economy is strong. Obviously, these equations do not perfectly fit every situation, but we expect that they are reasonably correct most of the time.

This chapter extends our machine learning toolkit by going beyond the classification methods covered previously and introducing techniques for estimating relationships among numeric...

Understanding regression


Regression involves specifying the relationship between a single numeric dependent variable (the value to be predicted) and one or more numeric independent variables (the predictors). As the name implies, the dependent variable depends upon the value of the independent variable or variables. The simplest forms of regression assume that the relationship between the independent and dependent variables follows a straight line.

Note

The origin of the term "regression" to describe the process of fitting lines to data is rooted in a study of genetics by Sir Francis Galton in the late 19th century. He discovered that fathers who were extremely short or tall tended to have sons whose heights were closer to the average height. He called this phenomenon "regression to the mean."

You might recall from basic algebra that lines can be defined in a slope-intercept form similar to y = a + bx. In this form, the letter y indicates the dependent variable and x indicates the independent...

Example – predicting medical expenses using linear regression


In order for a health insurance company to make money, it needs to collect more in yearly premiums than it spends on medical care to its beneficiaries. Consequently, insurers invest a great deal of time and money to develop models that accurately forecast medical expenses for the insured population.

Medical expenses are difficult to estimate because the costliest conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.

The goal of this analysis is to use patient data to forecast the average medical care expenses for such population segments. These estimates could be used to create actuarial tables that set the price of yearly premiums higher or lower according to the expected treatment costs.

Step 1 – collecting data

For this analysis, we will...

Understanding regression trees and model trees


If you recall from Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, a decision tree builds a model, much like a flowchart, in which decision nodes, leaf nodes, and branches define a series of decisions that are used to classify examples. Such trees can also be used for numeric prediction by making only small adjustments to the tree growing algorithm. In this section, we will consider the ways in which trees for numeric prediction differ from trees used for classification.

Trees for numeric prediction fall into two categories. The first, known as regression trees, were introduced in the 1980s as part of the seminal classification and regression tree (CART) algorithm. Despite the name, regression trees do not use linear regression methods as described earlier in this chapter; rather, they make predictions based on the average value of examples that reach a leaf.

Note

The CART algorithm is described in detail in Classification...

Example – estimating the quality of wines with regression trees and model trees


Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. As an agricultural product, variables as diverse as the weather and the growing environment impact the quality of a varietal. The bottling and manufacturing can also affect the flavor for better or worse. Even the way the product is marketed, from the bottle design to the price point, can affect the customer's perception of taste.

As a consequence, the winemaking industry has invested heavily in data collection and machine learning methods that may assist with the decision science of winemaking. For example, machine learning has been used to discover key differences in the chemical composition of wines from different regions, and to identify the chemical factors that lead a wine to taste sweeter.

More recently, machine learning has...

Summary


In this chapter, we studied two methods for modeling numeric data. The first method, linear regression, involves fitting straight lines to data. The second method uses decision trees for numeric prediction. The latter comes in two forms: regression trees, which use the average value of examples at leaf nodes to make numeric predictions, and model trees, which build a regression model at each leaf node in a hybrid approach that is, in some ways, the best of both worlds.

We began to understand the utility of regression modeling by using it to investigate the causes of the Challenger space shuttle disaster. We then used linear regression modeling to calculate the expected medical costs for various segments of the population. Because the relationship between the features and the target variable are well-described by the estimated regression model, we were able to identify certain demographics, such as smokers and the obese, who may need to be charged higher insurance rates to cover the...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with R - Third Edition
Published in: Apr 2019 Publisher: Packt ISBN-13: 9781788295864
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}