Reader small image

You're reading from  Deep Learning with Hadoop

Product typeBook
Published inFeb 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787124769
Edition1st Edition
Languages
Right arrow
Author (1)
Dipayan Dev
Dipayan Dev
author image
Dipayan Dev

Dipayan Dev has completed his M.Tech from National Institute of Technology, Silchar with a first class first and is currently working as a software professional in Bengaluru, India. He has extensive knowledge and experience in non-relational database technologies, having primarily worked with large-scale data over the last few years. His core expertise lies in Hadoop Framework. During his postgraduation, Dipayan had built an infinite scalable framework for Hadoop, called Dr. Hadoop, which got published in top-tier SCI-E indexed journal of Springer (http://link.springer.com/article/10.1631/FITEE.1500015). Dr. Hadoop has recently been cited by Goo Wikipedia in their Apache Hadoop article. Apart from that, he registers interest in a wide range of distributed system technologies, such as Redis, Apache Spark, Elasticsearch, Hive, Pig, Riak, and other NoSQL databases. Dipayan has also authored various research papers and book chapters, which are published by IEEE and top-tier Springer Journals. To know more about him, you can also visit his LinkedIn profile https://www.linkedin.com/in/dipayandev.
Read more about Dipayan Dev

Right arrow

Chapter 2.  Distributed Deep Learning for Large-Scale Data

 

"In God we trust, all others must bring data"

 
 --W. Edwards Deming

In this exponentially growing digital world, big data and deep learning are the two hottest technical trends. Deep learning and big data are two interrelated topics in the world of data science, and in terms of technological growth, both are critically interconnected and equally significant.

Digital data and cloud storage follow a generic law, termed as Moore's law [50], which roughly states that the world's data are doubling every two years; however, the cost of storing that data decreases at approximately the same rate. This profusion of data generates more features and verities, hence, to extract all the valuable information out of it, better deep learning models should be built.

This voluminous availability of data helps to bring huge opportunities for multiple sectors. Moreover, big data, with its analytic part, has produced lots of challenges in the field of...

Deep learning for massive amounts of data


In this Exa-Byte scale era, the data are increasing at an exponential rate. This growth of data are analyzed by many organizations and researchers in various ways, and also for so many different purposes. According to the survey of International Data Corporation (IDC), the Internet is processing approximately 2 Petabytes of data every day [51]. In 2006, the size of digital data was around 0.18 ZB, whereas this volume has increased to 1.8 ZB in 2011. Up to 2015, it was expected to reach up to 10 ZB in size, and by 2020, its volume in the world will reach up to approximately 30 ZB to 35 ZB. The timeline of this data mountain is shown in Figure 2.1. These immense amounts of data in the digital world are formally termed as big data.

 

"The world of Big Data is on fire"

 
 --The Economist, Sept 2011

Figure 2.1: Figure shows the increasing trend of data for a time span of around 20 years

Facebook has almost 21 PB in 200M objects [52], whereas Jaguar ORNL...

Challenges of deep learning for big data


The potential of big data is certainly noteworthy. However, to fully extract valuable information at this scale, we would require new innovations and promising algorithms to address many of these related technical problems. For example, to train the models, most of the traditional machine learning algorithms load the data in memory. But with a massive amount of data, this approach will surely not be feasible, as the system might run out of memory. To overcome all these gritty problems, and get the most out of the big data with the deep learning techniques, we will require brain storming.

Although, as discussed in the earlier section, large-scale deep learning has achieved many accomplishments in the past decade, this field is still in a growing phase. Big data is constantly raising limitations with its 4Vs. Therefore, to tackle all of those, many more advancements in the models need to take place.

Challenges of deep learning due to massive volumes of...

Distributed deep learning and Hadoop


From the earlier sections of this chapter, we already have enough insights on why and how the relationship of deep learning and big data can bring major changes to the research community. Also, a centralized system is not going to help this relationship substantially with the course of time. Hence, distribution of the deep learning network across multiple servers has become the primary goal of the current deep learning practitioners. However, dealing with big data in a distributed environment is always associated with several challenges. Most of those are explained in-depth in the previous section. These include dealing with higher dimensional data, data with too many features, amount of memory available to store, processing the massive Big datasets, and so on. Moreover, Big datasets have a high computational resource demand on CPU and memory time. So, the reduction of processing time has become an extremely significant criterion. The following are the...

Deeplearning4j - an open source distributed framework for deep learning


Deeplearning4j (DL4J) [82] is an open source deep learning framework which is written for JVM, and mainly used for commercial grade. The framework is written entirely in Java, and thus, the name '4j' is included. Because of its use with Java, Deeplearning4j has started to earn popularity with a much wider audience and range of practitioners.

This framework is basically composed of a distributed deep learning library that is integrated with Hadoop and Spark. With the help of Hadoop and Spark, we can very easily distribute the model and Big datasets, and run multiple GPUs and CPUs to perform parallel operations. Deeplearning4j has primarily shown substantial success in performing pattern recognition in images, sound, text, time series data, and so on. Apart from that, it can also be applied for various customer use cases such as facial recognition, fraud detection, business analytics, recommendation engines, image and...

Setting up Deeplearning4j on Hadoop YARN


Deeplearning4j primarily works on networks having multiple layers. To get started working with Deeplearning4j, one needs to get accustomed with the prerequisites, and how to install all the dependent software. Most of the documentation can be easily found on the official website of Deeplearning4j at https://deeplearning4j.org/ [88].

In this section of the chapter, we will help you to get familiar with the code of Deeplearning4j. Initially, we will show the implementation of a simple operation of a multilayer neural network with Deeplearning4j. The later part of the section will discuss distributed deep learning with Deeplearning4j library. Deeplearning4j trains distributed deep neural network on multiple distributed GPUs using Apache Spark. The later part of this section will also introduce the setup of Apache Spark for Deeplearning4j.

Getting familiar with Deeplearning4j

This part will mainly introduce the 'Hello World' programs of deep learning with...

Summary


In contrast to the traditional machine learning algorithms, deep learning models have the capability to address the challenges imposed by a massive amount of input data. Deep learning networks are designed to automatically extract complex representation of data from the unstructured data. This property makes deep learning a precious tool to learn the hidden information from the big data. However, due to the velocity at which the volume and varieties of data are increasing day by day, deep learning networks need to be stored and processed in a distributed manner. Hadoop, being the most widely used big data framework for such requirements, is extremely convenient in this situation. We explained the primary components of Hadoop that are essential for distributed deep learning architecture. The crucial characteristics of distributed deep learning networks were also explained in depth. Deeplearning4j, an open source distributed deep learning framework, integrates with Hadoop to achieve...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Hadoop
Published in: Feb 2017Publisher: PacktISBN-13: 9781787124769
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dipayan Dev

Dipayan Dev has completed his M.Tech from National Institute of Technology, Silchar with a first class first and is currently working as a software professional in Bengaluru, India. He has extensive knowledge and experience in non-relational database technologies, having primarily worked with large-scale data over the last few years. His core expertise lies in Hadoop Framework. During his postgraduation, Dipayan had built an infinite scalable framework for Hadoop, called Dr. Hadoop, which got published in top-tier SCI-E indexed journal of Springer (http://link.springer.com/article/10.1631/FITEE.1500015). Dr. Hadoop has recently been cited by Goo Wikipedia in their Apache Hadoop article. Apart from that, he registers interest in a wide range of distributed system technologies, such as Redis, Apache Spark, Elasticsearch, Hive, Pig, Riak, and other NoSQL databases. Dipayan has also authored various research papers and book chapters, which are published by IEEE and top-tier Springer Journals. To know more about him, you can also visit his LinkedIn profile https://www.linkedin.com/in/dipayandev.
Read more about Dipayan Dev