Chapter 6. Introducing the ML Package
In the previous chapter, we worked with the MLlib package in Spark that operated strictly on RDDs. In this chapter, we move to the ML part of Spark that operates strictly on DataFrames. Also, according to the Spark documentation, the primary machine learning API for Spark is now the DataFrame-based set of models contained in the spark.ml
package.
So, let's get to it!
Note
In this chapter, we will reuse a portion of the dataset we played within the previous chapter. The data can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_transformed.csv.gz.
In this chapter, you will learn how to do the following:
Prepare transformers, estimators, and pipelines
Predict the chances of infant survival using models available in the ML package
Evaluate the performance of the model
Perform parameter hyper-tuning
Use other machine-learning models available in the package