Reader small image

You're reading from  Mastering Predictive Analytics with Python

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785882715
Edition1st Edition
Languages
Right arrow
Author (1)
Joseph Babcock
Joseph Babcock
author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock

Right arrow

Scaling out with PySpark – predicting year of song release


To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile() function. As before in our clustering example, we also define a parsing function with a try…catch block so that we do not fail on a single error in a large dataset:

>>> def parse_line(l):
…      try:
…            return l.split(",")
…    except:
…         print("error in processing {0}".format(l))

We then use this function to map each line to the parsed format, which splits the comma...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Mastering Predictive Analytics with Python
Published in: Aug 2016Publisher: ISBN-13: 9781785882715

Author (1)

author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock