Apache Mahout Cookbook


Apache Mahout Cookbook
eBook: $26.99
Formats: PDF, PacktLib, ePub and Mobi formats
$22.94
save 15%!
Print + free eBook + free PacktLib access to the book: $71.98    Print cover: $44.99
$44.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Reviews
Support
Sample Chapters
  • Learn how to set up a Mahout development environment
  • Start testing Mahout in a standalone Hadoop cluster
  • Learn to find stock market direction using logistic regression
  • Over 35 recipes with real-world examples to help both skilled and the non-skilled developers get the hang of the different features of Mahout

Book Details

Language : English
Paperback : 250 pages [ 235mm x 191mm ]
Release Date : December 2013
ISBN : 1849518025
ISBN 13 : 9781849518024
Author(s) : Piero Giacomelli
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Cookbooks, Open Source


Table of Contents

Preface
Chapter 1: Mahout is Not So Difficult!
Chapter 2: Using Sequence Files – When and Why?
Chapter 3: Integrating Mahout with an External Datasource
Chapter 4: Implementing the Naïve Bayes classifier in Mahout
Chapter 5: Stock Market Forecasting with Mahout
Chapter 6: Canopy Clustering in Mahout
Chapter 7: Spectral Clustering in Mahout
Chapter 8: K-means Clustering
Chapter 9: Soft Computing with Mahout
Chapter 10: Implementing the Genetic Algorithm in Mahout
Index
  • Chapter 5: Stock Market Forecasting with Mahout
    • Introduction
    • Preparing data for logistic regression
    • Predicting GOOG movements using logistic regression
    • Using adaptive logistic regression in Java code
    • Using logistic regression on large-scale datasets
    • Using Random Forest to forecast market movements
  • Chapter 6: Canopy Clustering in Mahout
    • Introduction
    • Command-line-based Canopy clustering
    • Command-line-based Canopy clustering with parameters
    • Using Canopy clustering from the Java code
    • Coding your own cluster distance evaluation
  • Chapter 7: Spectral Clustering in Mahout
    • Introduction
    • Using EigenCuts from the command line
    • Using EigenCuts from Java code
    • Creating a similarity matrix from raw data
    • Using spectral clustering with image segmentation
  • Chapter 8: K-means Clustering
    • Introduction
    • Using K-means clustering from Java code
    • Clustering traffic accidents using K-means
    • K-means clustering using MapReduce
    • Using K-means clustering from the command line
  • Chapter 9: Soft Computing with Mahout
    • Introduction
    • Frequent Pattern Mining with Mahout
    • Creating metrics for Frequent Pattern Mining
    • Using Frequent Pattern Mining from Java code
    • Using LDA for creating topics

Piero Giacomelli

Piero Giacomelli started playing with computers back in 1986 when he received his first PC (a commodore 64). Despite his love for computers, he graduated in Mathematics, entered the professional software industry in 1997, and started using Java.

He has been involved in a lot of software projects using Java, .NET, and PHP. He is not only a great fan of JBoss and Apache technologies, but also uses Microsoft technologies without moral issues.

He has worked in many different industrial sectors, such as aerospace, ISP, textile and plastic manufacturing, and e-health association, both as a software developer and as an IT manager. He has also been involved in many EU research-funded projects in FP7 EU programs, such as CHRONIOUS, I-DONT-FALL, FEARLESS, and CHROMED.

In recent years, he has published some papers on scientific journals and has been awarded two best paper awards by the International Academy, Research and Industry Association (IARIA).

In 2012, he published HornetQ Messaging Developer's Guide, Packt Publishing, which is a standard reference book for the Apache HornetQ Framework.

He is married with two kids, and in his spare time, he regresses to his infancy ages to play with toys and his kids.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


Errata

- 3 submitted: last submission 26 Jun 2014

Page no: 200 | Errata type: Code

The given sentence and command is:


Then, from this sequence file, we need to calculate the weight as we did in Chapter 5, Stock
Market Forecasting with Mahout, to have the sequence file vector's point ready to be analyzed
by the LDA. So our last preprocessing step gives the following commands:

mahout seq2sparse -i $WORK_DIR/sequencefiles/ -o $WORK_DIR/vectors/  -wt
tf
mahout seq2sparse -i $WORK_DIR/sequencefiles/ -o $WORK_DIR/vectors/  -wt
tf

It should be:


Then, from this sequence file, we need to calculate the weight as we did in Chapter 5, Stock
Market Forecasting with Mahout, to have the sequence file vector's point ready to be analyzed
by the LDA. So our last preprocessing step gives us the following command:

mahout seq2sparse -i $WORK_DIR/sequencefiles/ -o $WORK_DIR/vectors/ -wt tf

 

Page No: 47 | Errata Type: Grammar

On this page, in the last paragraph, the sentence "And in many cases, considering that the data is comes out from our Mahout analysis..."

should be "And in many cases, considering thPat the data comes from our Mahout analysis..."

 

 

Page 180 | Errata Type: Technical

 

On thsi page, after the screenshot, the interpretation of the "retail.dat" file is wrong. Here, each row is a transaction and each column is an article identifier.

Hence the line So, for example, the first row states that the buyer has to buy 30pieces of item 0, 31of
item 1, and so on.

should be

So, for an example, the first row states that the buyer has to make a transaction comprising of 26 items, whose IDs are 0, 1 , 2 and so on

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Apache Mahout Cookbook +    Storm Blueprints: Patterns for Distributed Real-time Computation =
50% Off
the second eBook
Price for both: $39.00

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Configure from scratch a full development environment for Mahout with NetBeans and Maven
  • Handle sequencefiles for better performance
  • Query and store results into an RDBMS system with SQOOP
  • Use logistic regression to predict the next step
  • Understand text mining of raw data with Naïve Bayes
  • Create and understand clusters
  • Customize Mahout to evaluate different cluster algorithms
  • Use the mapreduce approach to solve real world data mining problems

In Detail

The rise of the Internet and social networks has created a new demand for software that can analyze large datasets that can scale up to 10 billion rows. Apache Hadoop has been created to handle such heavy computational tasks. Mahout gained recognition for providing data mining classification algorithms that can be used with such kind of datasets.

"Apache Mahout Cookbook" provides a fresh, scope-oriented approach to the Mahout world for both beginners as well as advanced users. The book gives an insight on how to write different data mining algorithms to be used in the Hadoop environment and choose the best one suiting the task in hand.

"Apache Mahout Cookbook" looks at the various Mahout algorithms available, and gives the reader a fresh solution-centered approach on how to solve different data mining tasks. The recipes start easy but get progressively complicated. A step-by-step approach will guide the developer in the different tasks involved in mining a huge dataset. You will also learn how to code your Mahout’s data mining algorithm to determine the best one for a particular task. Coupled with this, a whole chapter is dedicated to loading data into Mahout from an external RDMS system. A lot of attention has also been put on using your data mining algorithm inside your code so as to be able to use it in an Hadoop environment. Theoretical aspects of the algorithms are covered for information purposes, but every chapter is written to allow the developer to get into the code as quickly and smoothly as possible. This means that with every recipe, the book provides the code for reusing it using Maven as well as the Maven Mahout source code.

By the end of this book you will be able to code your procedure to do various data mining tasks with different algorithms and to evaluate and choose the best ones for your tasks.

Approach

"Apache Mahout Cookbook" uses over 35 recipes packed with illustrations and real-world examples to help beginners as well as advanced programmers get acquainted with the features of Mahout.

Who this book is for

"Apache Mahout Cookbook" is great for developers who want to have a fresh and fast introduction to Mahout coding. No previous knowledge of Mahout is required, and even skilled developers or system administrators will benefit from the various recipes presented.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software