Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with Python

Product typeBook

Published inAug 2016

Reading LevelIntermediate

Publisher

ISBN-139781785882715

Edition1st Edition

Languages

Python

Concepts

Predictive Analytics

Author (1)

Joseph Babcock

Chapter 4. Connecting the Dots with Models – Regression Methods

The trend line is a common feature of many business analyses. How much do purchases increase when ads are shown more often on a homepage? What is the average rating of videos on social media based on user age? What is the likelihood that a customer will buy a second product from your website if they bought their first more than 6 months ago? These sorts of questions can be answered by drawing a line representing the average change in our response (for example, purchases or ratings) as we vary the input (for example, user age or amount of past purchases) based on historical data, and using it to extrapolate the response for future data (where we only know the input, but not output yet). Calculating this line is termed regression, based on the hypothesis that our observations are scattered around the true relationship between the two variables, and on average future observations will regress (approach) the trend line between input...

Linear regression

Ordinary Least Squares (OLS).

We will start with the simplest model of linear regression, where we will simply try to fit the best straight line through the data points we have available. Recall that the formula for linear regression is:

Where y is a vector of n responses we are trying to predict, X is a vector of our input variable also of length n, and β is the slope response (how much the response y increases for each 1-unit increase in the value of X). However, we rarely have only a single input; rather, X will represent a set of input variables, and the response y is a linear combination of these inputs. In this case, known as multiple linear regression, X is a matrix of n rows (observations) and m columns (features), and β is a vector set of slopes or coefficients which, when multiplied by the features, gives the output. In essence, it is just the trend line incorporating many inputs, but will also allow us to compare the magnitude effect of different inputs on the...

Tree methods

In many datasets, the relationship between our inputs and output may not be a straight line. For example, consider the relationship between hour of the day and probability of posting on social media. If you were to draw a plot of this probability, it would likely increase during the evening and lunch break, and decline during the night, morning and workday, forming a sinusoidal wave pattern. A linear model cannot represent this kind of relationship, as the value of the response does not strictly increase or decrease with the hour of the day. What models, then, could we use to capture this relationship? In the specific case of time series models we could use approaches such as the Kalman filter described above, using the components of the structural time series equation to represent the cyclical 24-hour pattern of social media activity. In the following section we examine more general approaches that will apply both to time series data and to more general non-linear relationships...

Scaling out with PySpark – predicting year of song release

To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile() function. As before in our clustering example, we also define a parsing function with a try…catch block so that we do not fail on a single error in a large dataset:

>>> def parse_line(l):
…      try:
…            return l.split(",")
…    except:
…         print("error in processing {0}".format(l))

We then use this function to map each line to the parsed format, which splits the comma...

Summary

In this chapter, we examined the fitting of several regression models, including transforming input variables to the correct scale and accounting for categorical features correctly. In interpreting the coefficients of these models, we examined both cases where the classical assumptions of linear regression are fulfilled and broken. In the latter cases, we examined generalized linear models, GEE, mixed effects models, and time series models as alternative choices for our analyses. In the process of trying to improve the accuracy of our regression model, we fit both simple and regularized linear models. We also examined the use of tree-based regression models and how to optimize parameter choices in fitting them. Finally, we examined an example of using random forest in PySpark, which can be applied to larger datasets.

In the next chapter, we will examine data that has a discrete categorical outcome, instead of a continuous response. In the process, we will examine in more detail how...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Predictive Analytics with Python

Published in: Aug 2016Publisher: ISBN-13: 9781785882715

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages