Reader small image

You're reading from  Frank Kane's Taming Big Data with Apache Spark and Python

Product typeBook
Published inJun 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787287945
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Frank Kane
Frank Kane
author image
Frank Kane

Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.
Read more about Frank Kane

Right arrow

Introducing MLlib


If you're doing any real data or science data mining or machine learning stuff with Spark, you're going to find the MLlib library very helpful. MLlib (machine learning library) is built on top of Spark as part of the Spark package. It contains some useful libraries for machine learning and data mining and some functions that you might find helpful. Let's review what some of those are and take a look at them. When we're done, we'll actually use MLlib to generate movie recommendations for users using the MovieLens dataset again.

MLlib capabilities

The following is a list of different features of MLlib. They have support in the library to help you with these various techniques:

  • Feature extraction
    • Term Frequency / Inverse Document frequency useful for search
  • Basic statistics
    • Chi-squared test, Pearson or Spearman correlation, min, max, mean, and variance
  • Linear regression and logistic regression
  • Support Vector Machines
  • Naïve Bayes classifier
  • Decision trees
  • K-Means clustering
  • Principal...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Frank Kane's Taming Big Data with Apache Spark and Python
Published in: Jun 2017Publisher: PacktISBN-13: 9781787287945

Author (1)

author image
Frank Kane

Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.
Read more about Frank Kane