Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data
In this recipe, we will explore Spark's implementation of expectation maximization (EM) GaussianMixture(), which calculates the maximum likelihood given a set of features as input. It assumes a Gaussian mixture in which each point can be sampled from K number of sub-distributions (cluster memberships).
How to do it...
Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.
Set up the package location where the program will reside:
package spark.ml.cookbook.chapter8.
- Import the necessary packages for vector and matrix manipulation:
import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.clustering.GaussianMixture import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.sql.SparkSession
Create Spark's session object:
val spark = SparkSession .builder .master("local[*]") .appName("myGaussianMixture") .config("spark.sql.warehouse.dir", ...