The trend line is a common feature of many business analyses. How much do purchases increase when ads are shown more often on a homepage? What is the average rating of videos on social media based on user age? What is the likelihood that a customer will buy a second product from your website if they bought their first more than 6 months ago? These sorts of questions can be answered by drawing a line representing the average change in our response (for example, purchases or ratings) as we vary the input (for example, user age or amount of past purchases) based on historical data, and using it to extrapolate the response for future data (where we only know the input, but not output yet). Calculating this line is termed regression, based on the hypothesis that our observations are scattered around the true relationship between the two variables, and on average future observations will regress (approach) the trend line between input...
You're reading from Mastering Predictive Analytics with Python
Ordinary Least Squares (OLS).
We will start with the simplest model of linear regression, where we will simply try to fit the best straight line through the data points we have available. Recall that the formula for linear regression is:
Where y is a vector of n responses we are trying to predict, X is a vector of our input variable also of length n, and β is the slope response (how much the response y increases for each 1-unit increase in the value of X). However, we rarely have only a single input; rather, X will represent a set of input variables, and the response y is a linear combination of these inputs. In this case, known as multiple linear regression, X is a matrix of n rows (observations) and m columns (features), and β is a vector set of slopes or coefficients which, when multiplied by the features, gives the output. In essence, it is just the trend line incorporating many inputs, but will also allow us to compare the magnitude effect of different inputs on the...
In many datasets, the relationship between our inputs and output may not be a straight line. For example, consider the relationship between hour of the day and probability of posting on social media. If you were to draw a plot of this probability, it would likely increase during the evening and lunch break, and decline during the night, morning and workday, forming a sinusoidal wave pattern. A linear model cannot represent this kind of relationship, as the value of the response does not strictly increase or decrease with the hour of the day. What models, then, could we use to capture this relationship? In the specific case of time series models we could use approaches such as the Kalman filter described above, using the components of the structural time series equation to represent the cyclical 24-hour pattern of social media activity. In the following section we examine more general approaches that will apply both to time series data and to more general non-linear relationships...
To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile()
function. As before in our clustering example, we also define a parsing function with a try…catch
block so that we do not fail on a single error in a large dataset:
>>> def parse_line(l): … try: … return l.split(",") … except: … print("error in processing {0}".format(l))
We then use this function to map each line to the parsed format, which splits the comma...
In this chapter, we examined the fitting of several regression models, including transforming input variables to the correct scale and accounting for categorical features correctly. In interpreting the coefficients of these models, we examined both cases where the classical assumptions of linear regression are fulfilled and broken. In the latter cases, we examined generalized linear models, GEE, mixed effects models, and time series models as alternative choices for our analyses. In the process of trying to improve the accuracy of our regression model, we fit both simple and regularized linear models. We also examined the use of tree-based regression models and how to optimize parameter choices in fitting them. Finally, we examined an example of using random forest in PySpark, which can be applied to larger datasets.
In the next chapter, we will examine data that has a discrete categorical outcome, instead of a continuous response. In the process, we will examine in more detail how...