Apache Spark covers a wide spectrum of machine learning algorithms. The algorithms implemented in Spark 2.0.0 consist of packages: org.apache.spark.ml
for Scala and Java and pyspark.ml
for Python.
Tip
Prior to 1.6.0, the libraries were in the org.apache.spark.mllib
and pyspark.mllib
packages, but from 2.0, the MLlib APIs are in maintenance mode. So you should use the ML APIs. In this chapter, we will do so, with clarifying notes wherever needed.
The following table summarizes the machine learning algorithms and data transformation features available in Spark 2.0.0:
Algorithm |
Feature |
Notes |
Basic statistics |
Summary statistics |
Here mean, stdev, count, max, min, and numNonZeros are all part of |
Correlations and covariance |
Here, sql.functions are invoked as | |
Stratified sampling |
This provides two methods, |