Reader small image

You're reading from  Apache Spark for Data Science Cookbook

Product typeBook
Published inDec 2016
Publisher
ISBN-139781785880100
Edition1st Edition
Concepts
Right arrow
Author (1)
Padma Priya Chitturi
Padma Priya Chitturi
author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Right arrow

Chapter 9. Deep Learning on Spark

In this chapter, we'll cover the following recipes:

  • Installing CaffeOnSpark

  • Working with CaffeOnSpark

  • Running a feed-forward neural network with DeepLearning4j over Spark

  • Running an RBM with DeepLearning4j over Spark

  • Running a CNN for learning MNIST with DeepLearning4j over Spark

  • Installing TensorFlow

  • Working with Spark TensorFlow

Introduction


Deep learning is a new area of machine learning which has been introduced with the objective of moving machine learning closer to one of its original goals, which is Artificial Intelligence (AI). It is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications in finance.

Deep learning is the implementation of neural networks with more than a single hidden layer of neurons. The deep architectures vary quite considerably, with different implementations being optimized for different tasks or goals. To get familiar with neural networks, please get acquainted with the fundamentals of neural networks at http://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/ and http://neuralnetworksanddeeplearning.com/. The deep networks use many layers of non-linear information processing that are hierarchical in nature.

The deep models are capable of extracting useful, high-level, structured representations...

Installing CaffeOnSpark


Caffe is a fully open source deep learning framework which provides access to deep architectures. The code is written in C++ with CUDA used for GPU computation and supports bindings to Python/NumPy and MATLAB. In Caffe, multimedia scientists and practitioners have an orderly and extensible toolkit for state-of-the-art deep learning algorithms. It provides a complete toolkit for training, testing, fine-tuning and deploying models. It offers expressive architecture, modularity, Python and MATLAB bindings. Caffe also provides reference models for visual tasks.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.

How to do it…

  1. Before installing CaffeOnSpark, install the caffe prerequisites as follows:

          sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev
          libopencv-dev libhdf5-serial-dev protobuf-compiler
          sudo apt-get install...

Working with CaffeOnSpark


CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from the deep learning framework Caffe and Big DataFrame works such as Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing and feature extraction.

This is a Spark deep learning package. The API supports DataFrames so that the application can interface with a training dataset that was prepared using a Spark application and extract the predictions from the model or features from intermediate layers for results and data analysis using MLlib or SQL.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos and CaffeOnSpark ready to be run on a Spark/YARN cluster.

How to do it…

  1. A deep learning...

Running a feed-forward neural network with DeepLearning 4j over Spark


DeepLearning4j (DL4J) is an open source deep learning library written in Java and Scala and which is used in business environments. This can be easily integrated with GPU and scaled on Hadoop or Spark. It supports a stack of neural networks for image recognition, text analysis and speech to text. Hence, it has implementations for algorithms such as binary and continuous restricted Boltzmann machines, deep belief networks, de-noising auto-encoders, convolutional networks and recursive neural tensor networks.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, get familiar with ND4S, that is, n-dimensional arrays for Scala (Scala bindings for ND4J). ND4J and ND4S are scientific computing libraries for the JVM. Please visit http://nd4j.org/ for details. The pre-requisites to be installed...

Running an RBM with DeepLearning4j over Spark


In this recipe, we'll see how to run a restricted Boltzmann machine for classifying the iris dataset.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, get familiar with ND4S, that is, n-dimensional arrays for Scala (Scala bindings for ND4J). ND4J and ND4S are scientific computing libraries for the JVM. Please visit http://nd4j.org/ for details. The pre-requisites to be installed are Java 7, IntelliJ, and the Maven or SBT build tool.

How to do it…

  1. Start an application named RBMWithSpark. Initially, specify the following libraries in the build.sbt file:

          libraryDependencies ++= Seq( 
          "org.apache.spark" %% "spark-core" % "1.6.0", 
          "org.apache.spark" %% "spark-mllib" % "1.6.0", 
          "org.deeplearning4j" % "deeplearning4j-core" % "0.4-rc3.8", 
          "org.deeplearning4j" ...

Running a CNN for learning MNIST with DeepLearning4j over Spark


In this recipe, we'll see how to run a CNN for classifying the iris dataset.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, get familiar with ND4S, that is, n-dimensional arrays for Scala (Scala bindings for ND4J). ND4J and ND4S are scientific computing libraries for the JVM. Please visit http://nd4j.org/ for details. The prerequisites to be installed are Java 7, IntelliJ, and the Maven or SBT build tool.

How to do it…

  1. The MNIST database is a large set of handwritten digits used to train neural networks and other algorithms in image recognition. This dataset has 60,000 images in its training set and 10,000 in its test set. Each image is a 28X28 pixel.

  2. Here is the code for a convolutional neural network which uses the MNIST dataset for digit recognition:

          object CNN_MNIST { 
     ...

Installing TensorFlow


TensorFlow is an interface for expressing machine learning algorithms, and it's an implementation for executing such algorithms. The TensorFlow computation can be expressed with little or no change on a wide variety of heterogeneous systems, such as mobile phones, tablets, and large-scale distributed systems of hundreds of machines. It is flexible and can express a wide variety of algorithms, such as training and inference algorithms for deep neural network models. It is also used for deploying machine learning systems into production across many areas, such as speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and so on.

The application of tensors and their networks is a relatively new (but fast-evolving) approach in machine learning. Tensors, if you recall your algebra classes, are simply n-dimensional data arrays (so a scalar is a 0th order tensor, a vector is 1st order, and a matrix...

Working with Spark TensorFlow


As Spark offers distributed computation, it can be used to perform neural network training on large data and the model deployment could be done at scale. The distributed training cuts down the training time, improves accuracy and also speeds up the model validation over a single-node model validation. The ability to scale model selection and neural network tuning by adopting tools such as Spark and TensorFlow may be a boon for the data science and machine learning communities because of the increasing availability of cloud computing and parallel resources to a wider range of engineers.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have Installing TensorFlow recipe for details on the installation.

How to do it…

  1. Here is the Python code to run TensorFlow in distributed mode:

          import numpy as np 
          import tensorflow...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi