In this section, we will have a quick look at Apache Mahout.
Do you know how Mahout got its name?
As you can see in the logo, a mahout is a person who drives an elephant. Hadoop's logo is an elephant. So, this is an indicator that Mahout's goal is to use Hadoop in the right manner.
The following are the features of Mahout:
It is a project of the Apache software foundation
It is a scalable machine learning library
It mainly contains clustering, classification, and recommendation (collaborative filtering) algorithms
Here, machine learning algorithms can be executed in sequential (in-memory mode) or distributed mode (MapReduce is enabled)
Most of the algorithms are implemented using the MapReduce paradigm
It runs on top of the Hadoop framework for scaling
Data is stored in HDFS (data storage) or in memory
It is a Java library (no user interface!)
The latest released version is 0.9, and 1.0 is coming soon
It is not a domain-specific but a general purpose library
Note
For those of you who are curious! What are the problems that Mahout is trying to solve? The following problems that Mahout is trying to solve:
The amount of available data is growing drastically.
The computer hardware market is geared toward providing better performance in computers. Machine learning algorithms are computationally expensive algorithms. However, there was no framework sufficient to harness the power of hardware (multicore computers) to gain better performance.
The need for a parallel programming framework to speed up machine learning algorithms.
Mahout is a general parallelization for machine learning algorithms (the parallelization method is not algorithm-specific).
No specialized optimizations are required to improve the performance of each algorithm; you just need to add some more cores.
Linear speed up with number of cores.
Each algorithm, such as Naïve Bayes, K-Means, and Expectation-maximization, is expressed in the summation form. (I will explain this in detail in future chapters.)
For more information, please read Map-Reduce for Machine Learning on Multicore, which can be found at http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf.