Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Product type Course

Published in Dec 2018

Publisher Packt

ISBN-13 9781789959208

Length 616 pages

Edition 1st Edition

Languages

Processing

Tools

Apache Spark

Concepts

Big Data

Authors (7):

Romeo Kienzler

Md. Rezaul Karim

Sridhar Alla

Siamak Amirghodsi

Meenakshi Rajendran

Broderick Hall

Shuen Mei

+3 more

View More author details

Table of Contents (23) Chapters

Title Page

About Packt

Contributors

Preface

1. A First Taste and What's New in Apache Spark V2

2. Apache Spark Streaming FREE CHAPTER

3. Structured Streaming

4. Apache Spark MLlib

5. Apache SparkML

6. Apache SystemML

7. Apache Spark GraphX

8. Spark Tuning

9. Testing and Debugging Spark

10. Practical Machine Learning with Spark Using Scala

11. Spark's Three Data Musketeers for Machine Learning - Perfect Together

12. Common Recipes for Implementing a Robust Machine Learning System

13. Recommendation Engine that Scales with Spark

14. Unsupervised Clustering with Apache Spark 2.0

15. Implementing Text Analytics with Spark 2.0 ML Library

16. Spark Streaming and Machine Learning Library

1. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

In this recipe, we will explore Spark's implementation of expectation maximization (EM) GaussianMixture(), which calculates the maximum likelihood given a set of features as input. It assumes a Gaussian mixture in which each point can be sampled from K number of sub-distributions (cluster memberships).

How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
Set up the package location where the program will reside:

package spark.ml.cookbook.chapter8.

Import the necessary packages for vector and matrix manipulation:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.SparkSession

Create Spark's session object:

val spark = SparkSession
 .builder
.master("local[*]")
 .appName("myGaussianMixture")
 .config("spark.sql.warehouse.dir", ...

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Apache Spark 2: Data Processing and Real-Time Analytics Master complex big data processing, stream analytics, and machine learning with Apache Spark

Table of Contents (23) Chapters

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

How to do it...

Authors (7)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access