Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Product type Course

Published in Dec 2018

Publisher Packt

ISBN-13 9781789959208

Length 616 pages

Edition 1st Edition

Languages

Processing

Tools

Apache Spark

Concepts

Big Data

Authors (7):

Romeo Kienzler

Md. Rezaul Karim

Sridhar Alla

Siamak Amirghodsi

Meenakshi Rajendran

Broderick Hall

Shuen Mei

+3 more

View More author details

Table of Contents (23) Chapters

Title Page

About Packt

Contributors

Preface

1. A First Taste and What's New in Apache Spark V2

2. Apache Spark Streaming FREE CHAPTER

3. Structured Streaming

4. Apache Spark MLlib

5. Apache SparkML

6. Apache SystemML

7. Apache Spark GraphX

8. Spark Tuning

9. Testing and Debugging Spark

10. Practical Machine Learning with Spark Using Scala

11. Spark's Three Data Musketeers for Machine Learning - Perfect Together

12. Common Recipes for Implementing a Robust Machine Learning System

13. Recommendation Engine that Scales with Spark

14. Unsupervised Clustering with Apache Spark 2.0

15. Implementing Text Analytics with Spark 2.0 ML Library

16. Spark Streaming and Machine Learning Library

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

CrossValidation and hyperparameter tuning

We will be looking at one example each of CrossValidation and hyperparameter tuning. Let's take a look at CrossValidation.

CrossValidation

As stated before, we've used the default parameters of the machine learning algorithm and we don't know if they are a good choice. In addition, instead of simply splitting your data into training and testing, or training, testing, and validation sets, CrossValidation might be a better choice because it makes sure that eventually all the data is seen by the machine learning algorithm.

Note

CrossValidation basically splits your complete available training data into a number of k folds. This parameter k can be specified. Then, the whole Pipeline is run once for every fold and one machine learning model is trained for each fold. Finally, the different machine learning models obtained are joined. This is done by a voting scheme for classifiers or by averaging for regression.

The following figure illustrates ten-fold CrossValidation...