Packt+ | Advance your knowledge in tech

You're reading from Apache Spark for Data Science Cookbook

Product typeBook

Published inDec 2016

Publisher

ISBN-139781785880100

Edition1st Edition

Tools

Apache Spark Pandas

Concepts

Data Science

Author (1)

Padma Priya Chitturi

Chapter 9. Deep Learning on Spark

In this chapter, we'll cover the following recipes:

Installing CaffeOnSpark
Working with CaffeOnSpark
Running a feed-forward neural network with DeepLearning4j over Spark
Running an RBM with DeepLearning4j over Spark
Running a CNN for learning MNIST with DeepLearning4j over Spark
Installing TensorFlow
Working with Spark TensorFlow

Introduction

Deep learning is a new area of machine learning which has been introduced with the objective of moving machine learning closer to one of its original goals, which is Artificial Intelligence (AI). It is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications in finance.

Deep learning is the implementation of neural networks with more than a single hidden layer of neurons. The deep architectures vary quite considerably, with different implementations being optimized for different tasks or goals. To get familiar with neural networks, please get acquainted with the fundamentals of neural networks at http://www.analyticsvidhya.com/blog/2016/03/introduction-deep-learning-fundamentals-neural-networks/ and http://neuralnetworksanddeeplearning.com/. The deep networks use many layers of non-linear information processing that are hierarchical in nature.

The deep models are capable of extracting useful, high-level, structured representations...

Installing CaffeOnSpark

Caffe is a fully open source deep learning framework which provides access to deep architectures. The code is written in C++ with CUDA used for GPU computation and supports bindings to Python/NumPy and MATLAB. In Caffe, multimedia scientists and practitioners have an orderly and extensible toolkit for state-of-the-art deep learning algorithms. It provides a complete toolkit for training, testing, fine-tuning and deploying models. It offers expressive architecture, modularity, Python and MATLAB bindings. Caffe also provides reference models for visual tasks.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.

How to do it…

Before installing CaffeOnSpark, install the caffe prerequisites as follows:

      sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev
      libopencv-dev libhdf5-serial-dev protobuf-compiler
      sudo apt-get install...

Working with CaffeOnSpark

CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from the deep learning framework Caffe and Big DataFrame works such as Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing and feature extraction.

This is a Spark deep learning package. The API supports DataFrames so that the application can interface with a training dataset that was prepared using a Spark application and extract the predictions from the model or features from intermediate layers for results and data analysis using MLlib or SQL.

Getting ready

To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos and CaffeOnSpark ready to be run on a Spark/YARN cluster.

How to do it…

A deep learning...

Running a feed-forward neural network with DeepLearning 4j over Spark

DeepLearning4j (DL4J) is an open source deep learning library written in Java and Scala and which is used in business environments. This can be easily integrated with GPU and scaled on Hadoop or Spark. It supports a stack of neural networks for image recognition, text analysis and speech to text. Hence, it has implementations for algorithms such as binary and continuous restricted Boltzmann machines, deep belief networks, de-noising auto-encoders, convolutional networks and recursive neural tensor networks.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, get familiar with ND4S, that is, n-dimensional arrays for Scala (Scala bindings for ND4J). ND4J and ND4S are scientific computing libraries for the JVM. Please visit http://nd4j.org/ for details. The pre-requisites to be installed...

Running an RBM with DeepLearning4j over Spark

In this recipe, we'll see how to run a restricted Boltzmann machine for classifying the iris dataset.

Getting ready

How to do it…

Start an application named RBMWithSpark. Initially, specify the following libraries in the build.sbt file:

      libraryDependencies ++= Seq( 
      "org.apache.spark" %% "spark-core" % "1.6.0", 
      "org.apache.spark" %% "spark-mllib" % "1.6.0", 
      "org.deeplearning4j" % "deeplearning4j-core" % "0.4-rc3.8", 
      "org.deeplearning4j" ...

Running a CNN for learning MNIST with DeepLearning4j over Spark

In this recipe, we'll see how to run a CNN for classifying the iris dataset.

Getting ready

How to do it…

The MNIST database is a large set of handwritten digits used to train neural networks and other algorithms in image recognition. This dataset has 60,000 images in its training set and 10,000 in its test set. Each image is a 28X28 pixel.
Here is the code for a convolutional neural network which uses the MNIST dataset for digit recognition:
```
      object CNN_MNIST { 
 ...
```

Installing TensorFlow

TensorFlow is an interface for expressing machine learning algorithms, and it's an implementation for executing such algorithms. The TensorFlow computation can be expressed with little or no change on a wide variety of heterogeneous systems, such as mobile phones, tablets, and large-scale distributed systems of hundreds of machines. It is flexible and can express a wide variety of algorithms, such as training and inference algorithms for deep neural network models. It is also used for deploying machine learning systems into production across many areas, such as speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and so on.

The application of tensors and their networks is a relatively new (but fast-evolving) approach in machine learning. Tensors, if you recall your algebra classes, are simply n-dimensional data arrays (so a scalar is a 0th order tensor, a vector is 1st order, and a matrix...

Working with Spark TensorFlow

As Spark offers distributed computation, it can be used to perform neural network training on large data and the model deployment could be done at scale. The distributed training cuts down the training time, improves accuracy and also speeds up the model validation over a single-node model validation. The ability to scale model selection and neural network tuning by adopting tools such as Spark and TensorFlow may be a boon for the data science and machine learning communities because of the increasing availability of cloud computing and parallel resources to a wider range of engineers.

Getting ready

To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, have Installing TensorFlow recipe for details on the installation.

How to do it…

Here is the Python code to run TensorFlow in distributed mode:
```
      import numpy as np 
      import tensorflow...
```

The rest of the chapter is locked

You have been reading a chapter from

Apache Spark for Data Science Cookbook

Published in: Dec 2016Publisher: ISBN-13: 9781785880100

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages