Reader small image

You're reading from  Mastering Predictive Analytics with Python

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785882715
Edition1st Edition
Languages
Right arrow
Author (1)
Joseph Babcock
Joseph Babcock
author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock

Right arrow

Chapter 4. Connecting the Dots with Models – Regression Methods

The trend line is a common feature of many business analyses. How much do purchases increase when ads are shown more often on a homepage? What is the average rating of videos on social media based on user age? What is the likelihood that a customer will buy a second product from your website if they bought their first more than 6 months ago? These sorts of questions can be answered by drawing a line representing the average change in our response (for example, purchases or ratings) as we vary the input (for example, user age or amount of past purchases) based on historical data, and using it to extrapolate the response for future data (where we only know the input, but not output yet). Calculating this line is termed regression, based on the hypothesis that our observations are scattered around the true relationship between the two variables, and on average future observations will regress (approach) the trend line between input...

Linear regression


Ordinary Least Squares (OLS).

We will start with the simplest model of linear regression, where we will simply try to fit the best straight line through the data points we have available. Recall that the formula for linear regression is:

Where y is a vector of n responses we are trying to predict, X is a vector of our input variable also of length n, and β is the slope response (how much the response y increases for each 1-unit increase in the value of X). However, we rarely have only a single input; rather, X will represent a set of input variables, and the response y is a linear combination of these inputs. In this case, known as multiple linear regression, X is a matrix of n rows (observations) and m columns (features), and β is a vector set of slopes or coefficients which, when multiplied by the features, gives the output. In essence, it is just the trend line incorporating many inputs, but will also allow us to compare the magnitude effect of different inputs on the...

Tree methods


In many datasets, the relationship between our inputs and output may not be a straight line. For example, consider the relationship between hour of the day and probability of posting on social media. If you were to draw a plot of this probability, it would likely increase during the evening and lunch break, and decline during the night, morning and workday, forming a sinusoidal wave pattern. A linear model cannot represent this kind of relationship, as the value of the response does not strictly increase or decrease with the hour of the day. What models, then, could we use to capture this relationship? In the specific case of time series models we could use approaches such as the Kalman filter described above, using the components of the structural time series equation to represent the cyclical 24-hour pattern of social media activity. In the following section we examine more general approaches that will apply both to time series data and to more general non-linear relationships...

Scaling out with PySpark – predicting year of song release


To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile() function. As before in our clustering example, we also define a parsing function with a try…catch block so that we do not fail on a single error in a large dataset:

>>> def parse_line(l):
…      try:
…            return l.split(",")
…    except:
…         print("error in processing {0}".format(l))

We then use this function to map each line to the parsed format, which splits the comma...

Summary


In this chapter, we examined the fitting of several regression models, including transforming input variables to the correct scale and accounting for categorical features correctly. In interpreting the coefficients of these models, we examined both cases where the classical assumptions of linear regression are fulfilled and broken. In the latter cases, we examined generalized linear models, GEE, mixed effects models, and time series models as alternative choices for our analyses. In the process of trying to improve the accuracy of our regression model, we fit both simple and regularized linear models. We also examined the use of tree-based regression models and how to optimize parameter choices in fitting them. Finally, we examined an example of using random forest in PySpark, which can be applied to larger datasets.

In the next chapter, we will examine data that has a discrete categorical outcome, instead of a continuous response. In the process, we will examine in more detail how...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Predictive Analytics with Python
Published in: Aug 2016Publisher: ISBN-13: 9781785882715
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock