Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Implementing a Spark ML classification model


The first step in implementing a machine learning is to perform EDA on input data. This analysis would typically involve data visualization using tools such as Zeppelin, assessing feature types (numeric/categorical), computing basic statistics, computing covariances, and correlation coefficients, creating pivot tables, and so on (for more details on EDA, see Chapter 3, Using Spark SQL for Data Exploration).

The next step involves executing data pre-processing and/or data munging operations. In almost all cases, the real-world input data will not be high quality data ready for use in a model.  There will be several transformations required to convert the features from the source format to final variables; for example, categorical features may need to be transformed to a binary variable for each categorical value using one-hot encoding technique (for more details on data munging, see Chapter 4Using Spark SQL for Data Munging).

Next is the feature...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359