Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 11. Hadoop in the Cloud

In the previous chapters, we focused on a few basic concepts of machine learning, case studies for machine learning, how to stream data ingestion, and so on. Until now, we have walked you through the different components of data ingestion and data processing, some advanced concepts of the big data ecosystem, and a few best design practices that have to be taken into consideration while designing and implementing Hadoop applications. For any data pipeline that requires infrastructure setup to execute the data pipeline, the infrastructure can either be set up on premises or on the cloud. In this chapter, we will cover the following topics:

Logical view of Hadoop in the cloud
How a network setup looks on the cloud
Resource management made easy
How to make data pipelines on the cloud
Cloud high availability

Technical requirements

You will be required to have basic knowledge of AWS Services and Apache Hadoop 3.0.

Check out the following video to see the code in action:http://bit.ly/2tGbwWz

Logical view of Hadoop in the cloud

The usability of infrastructure in the cloud has been increasing at a rapid speed. The companies who were worrying about adopting the cloud for its computing and storage needs due to security and availability now have enough confidence as there have been lots of improvements on the architecture and features being made by different cloud service providers. Apache Hadoop started with on premises deployment and during the initial years of the Hadoop foundation, the Hadoop infrastructure was mainly set up using self-managed infrastructure. But over the last few years, cloud service providers have given focus to the Hadoop infrastructure and now we have everything that's needed for a big data/Hadoop setup. Today, almost every company moving to big data adaptability is using the cloud for their data storage and infrastructure needs. In this section, we will talk about how the logical architecture for Hadoop looks over the cloud:

Ingestion layer: Data ingestion...

Network

Today or tomorrow your organization or customers are likely to migrate their software and data infrastructure to the cloud. Given the shortage of engineers who most likely are experts on cloud infrastructure, you may also have to understand the cloud network infrastructure. In this section, we will cover different concepts of cloud networks and understand how an on-premise infrastructure is different from that available on cloud platforms. It doesn't matter which cloud provider you go with, such as AWS, Azure, GCP, and so on. The concept of the network remains same across each cloud provider.

Regions and availability zone

Regions are nothing but some geographical location in the world where the network and compute resources are available. Regions are governed by specific laws applicable to that region. For example, a region that is name based on anywhere in China would have to follow China's country policies with respect to cloud usage. Each resources may or may not be available in...

Managing resources

Resource management is a continuous process for either on-premise infrastructures or infrastructure on the cloud. The instance you deploy, the cluster you spin up, and the storage you use all have to be continuously monitored and managed by the infrastructure team. Sometimes it may happen that you need to attach a volume to the already running instance, you may need to add an extra node to your distributed processing databases, or you may need to add instances to handle large traffic coming to the load balancer. The other resource management work includes configuration management, such as changing firewall rules, adding new users to access the resources, adding new rules to access other resources from the current resource, and so on.

Initially due to a lack of good GUI interfaces and available tools for monitoring and managing resources, it was managed using custom scripts or commands. Today almost all cloud providers have well augmented graphical user interfaces and tools...

Data pipelines

Data pipelines represent the flow of data from extraction to business reporting, which involves intermediate steps such as cleansing, transformation, and loading gold data to the reporting layer. The data pipelines may be either real-time, near mealtime, or batch. The storage system may be either distributed storage, like HDFS, S3 or distributed high throughput messaging queue, such as Kafka and Kinesis.

In this section, we will talk about how we can use cloud providers' tools all together to build data pipelines for customers. If you have already gone through the preceding section about the logical architecture in Hadoop, this section will be easy to understand. We have been continuously mentioning that every cloud provider available today has some equivalent tools available with respect to open source tools. Each cloud provider claims some rich set of features and have a benchmark test available to support their claim, but remember these benchmark test are specific to their...

High availability (HA)

High availability (HA) is a primary focus for all the framework or application available today. The application can be deployed either on-premise or over the cloud. There are many cloud service providers available today, such as Amazon AWS, Microsoft Azure, Google Cloud Platform, IBM Cloud, and so on. Achieving high availability using on-premise deployment has limited capability, for example even if we have multi node clusters available on-premise that have both HDFS storage and a processing engine up and running, it does not guarantee the required high level of availability in case any disaster happens. It requires also strict monitoring of the on-premise cluster in order to avoid any major loss and guaranteeing high availability.

In other words, cloud services provide more robust and reliable high availability features and there are multiple scenarios for considering what level of high availability we want to achieve it. Let us look into few conceptual scenarios...

Summary

In this chapter, we have studied the logical view of Hadoop in the cloud and how the logical architecture for Hadoop would look over the cloud. We also learned about managing resources, which is a continuous process for either on-premise infrastructure or infrastructure on the cloud. We got introduced to the data pipelines and how we can use cloud providers' tools all together to build a data pipeline for customers. This chapter also focuses on High Availability, which is a primary focus for all the frameworks and applications available today. The application can be deployed either on-premise or over the cloud.

In the next chapter, we will study Hadoop Cluster Profiling.