Reader small image

You're reading from  Deep Learning with Hadoop

Product typeBook
Published inFeb 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787124769
Edition1st Edition
Languages
Right arrow
Author (1)
Dipayan Dev
Dipayan Dev
author image
Dipayan Dev

Dipayan Dev has completed his M.Tech from National Institute of Technology, Silchar with a first class first and is currently working as a software professional in Bengaluru, India. He has extensive knowledge and experience in non-relational database technologies, having primarily worked with large-scale data over the last few years. His core expertise lies in Hadoop Framework. During his postgraduation, Dipayan had built an infinite scalable framework for Hadoop, called Dr. Hadoop, which got published in top-tier SCI-E indexed journal of Springer (http://link.springer.com/article/10.1631/FITEE.1500015). Dr. Hadoop has recently been cited by Goo Wikipedia in their Apache Hadoop article. Apart from that, he registers interest in a wide range of distributed system technologies, such as Redis, Apache Spark, Elasticsearch, Hive, Pig, Riak, and other NoSQL databases. Dipayan has also authored various research papers and book chapters, which are published by IEEE and top-tier Springer Journals. To know more about him, you can also visit his LinkedIn profile https://www.linkedin.com/in/dipayandev.
Read more about Dipayan Dev

Right arrow

Chapter 7. Miscellaneous Deep Learning Operations using Hadoop

 

"In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers."

 
 --Grace Hopper

So far in this book, we discussed various deep neural network models and their concepts, applications, and implementation of the models in distributed environments. We have also explained why it is difficult for a centralized computer to store and process vast amounts of data and extract information using these models. Hadoop has been used to overcome the limitations caused by large-scale data.

As we have now reached the final chapter of this book, we will mainly discuss the design of the three most commonly used machine learning applications. We will explain the general concept of large-scale video processing, large-scale image processing, and natural language processing using the Hadoop framework.

The organization...

Distributed video decoding in Hadoop


Most of the popular video compression formats, such as MPEG-2 and MPEG-4, follow a hierarchical structure in the bit-stream. In this subsection, we will assume that the compression format used has a hierarchical structure for its bit-stream. For simplicity, we have divided the decoding task into two different Map-reduce jobs:

  1. Extraction of video sequence level information: From the outset, it can be easily predicted that the header information of all the video dataset can be found in the first block of the dataset. In this phase, the aim of the map-reduce job is to collect the sequence level information from the first block of the video dataset and output the result as a text file in the HDFS. The sequence header information is needed to set the format for the decoder object.

    For the video files, a new FileInputFormat should be implemented with its own record reader. Each record reader will then provide a <key, value> pair in this format to each...

Large-scale image processing using Hadoop


We have already mentioned in the earlier chapters how the size and volume of images are increasing day by day; the need to store and process these vast amount of images is difficult for centralized computers. Let's consider an example to get a practical idea of such situations. Let's take a large-scale image of size 81025 pixels by 86273 pixels. Each pixel is composed of three values:red, green, and blue. Consider that, to store each of these values, a 32-bit precision floating point number is required. Therefore, the total memory consumption of that image can be calculated as follows:

86273 * 81025 * 3 * 32 bits = 78.12 GB

Leave aside doing any post processing on this image, as it can be clearly concluded that it is impossible for a traditional computer to even store this amount of data in its main memory. Even though some advanced computers come with higher configurations, given the return on investment, most companies do not opt for these computers...

Natural language processing using Hadoop


The exponential growth of information in the Web has increased the intensity of diffusion of large-scale unstructured natural language textual resources. Hence, in the last few years, the interest to extract, process, and share this information has increased substantially. Processing these sources of knowledge within a stipulated time frame has turned out to be a major challenge for various research and commercial industries. In this section, we will describe the process used to crawl the web documents, discover the information and run natural language processing in a distributed manner using Hadoop.

To design architecture for natural language processing (NLP), the first task to be performed is the extraction of annotated keywords and key phrases from the large-scale unstructured data. To perform the NLP on a distributed architecture, the Apache Hadoop framework can be chosen for its efficient and scalable solution, and also to improve the failure...

Summary


This chapter discussed the most widely used applications of Machine learning and how they can be designed in the Hadoop framework. First, we started with a large video set and showed how the video can be decoded in the HDFS and later converted into a sequence file containing images for later processing. Large-scale image processing was discussed next in the chapter. The mapper used for this purpose has a shell script which performs all the tasks necessary. So, no reducer is necessary to perform this operation. Finally, we discussed how the natural language processing model can be deployed in Hadoop.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Hadoop
Published in: Feb 2017Publisher: PacktISBN-13: 9781787124769
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dipayan Dev

Dipayan Dev has completed his M.Tech from National Institute of Technology, Silchar with a first class first and is currently working as a software professional in Bengaluru, India. He has extensive knowledge and experience in non-relational database technologies, having primarily worked with large-scale data over the last few years. His core expertise lies in Hadoop Framework. During his postgraduation, Dipayan had built an infinite scalable framework for Hadoop, called Dr. Hadoop, which got published in top-tier SCI-E indexed journal of Springer (http://link.springer.com/article/10.1631/FITEE.1500015). Dr. Hadoop has recently been cited by Goo Wikipedia in their Apache Hadoop article. Apart from that, he registers interest in a wide range of distributed system technologies, such as Redis, Apache Spark, Elasticsearch, Hive, Pig, Riak, and other NoSQL databases. Dipayan has also authored various research papers and book chapters, which are published by IEEE and top-tier Springer Journals. To know more about him, you can also visit his LinkedIn profile https://www.linkedin.com/in/dipayandev.
Read more about Dipayan Dev