Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (23) Chapters

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Journey to Hadoop 3 Deep Dive into the Hadoop Distributed File System YARN Resource Management in Hadoop Internals of MapReduce SQL on Hadoop Real-Time Processing Engines Widely Used Hadoop Ecosystem Components Designing Applications in Hadoop Real-Time Stream Processing in Hadoop Machine Learning in Hadoop Hadoop in the Cloud Hadoop Cluster Profiling Who Can Do What in Hadoop Network and Data Security Monitoring Hadoop Other Books You May Enjoy Index

Chapter 11. Hadoop in the Cloud

In the previous chapters, we focused on a few basic concepts of machine learning, case studies for machine learning, how to stream data ingestion, and so on. Until now, we have walked you through the different components of data ingestion and data processing, some advanced concepts of the big data ecosystem, and a few best design practices that have to be taken into consideration while designing and implementing Hadoop applications. For any data pipeline that requires infrastructure setup to execute the data pipeline, the infrastructure can either be set up on premises or on the cloud. In this chapter, we will cover the following topics:

  • Logical view of Hadoop in the cloud
  • How a network setup looks on the cloud
  • Resource management made easy 
  • How to make data pipelines on the cloud
  • Cloud high availability 

Technical requirements


You will be required to have basic knowledge of AWS Services and Apache Hadoop 3.0.

Check out the following video to see the code in action:http://bit.ly/2tGbwWz

 

Logical view of Hadoop in the cloud


The usability of infrastructure in the cloud has been increasing at a rapid speed. The companies who were worrying about adopting the cloud for its computing and storage needs due to security and availability now have enough confidence as there have been lots of improvements on the architecture and features being made by different cloud service providers. Apache Hadoop started with on premises deployment and during the initial years of the Hadoop foundation, the Hadoop infrastructure was mainly set up using self-managed infrastructure. But over the last few years, cloud service providers have given focus to the Hadoop infrastructure and now we have everything that's needed for a big data/Hadoop setup. Today, almost every company moving to big data adaptability is using the cloud for their data storage and infrastructure needs. In this section, we will talk about how the logical architecture for Hadoop looks over the cloud:

  • Ingestion layer: Data ingestion...

Network


Today or tomorrow your organization or customers are likely to migrate their software and data infrastructure to the cloud. Given the shortage of engineers who most likely are experts on cloud infrastructure, you may also have to understand the cloud network infrastructure. In this section, we will cover different concepts of cloud networks and understand how an on-premise infrastructure is different from that available on cloud platforms. It doesn't matter which cloud provider you go with, such as AWS, Azure, GCP, and so on. The concept of the network remains same across each cloud provider.

Regions and availability zone

Regions are nothing but some geographical location in the world where the network and compute resources are available. Regions are governed by specific laws applicable to that region. For example, a region that is name based on anywhere in China would have to follow China's country policies with respect to cloud usage. Each resources may or may not be available in...

Managing resources


Resource management is a continuous process for either on-premise infrastructures or infrastructure on the cloud. The instance you deploy, the cluster you spin up, and the storage you use all have to be continuously monitored and managed by the infrastructure team. Sometimes it may happen that you need to attach a volume to the already running instance, you may need to add an extra node to your distributed processing databases, or you may need to add instances to handle large traffic coming to the load balancer. The other resource management work includes configuration management, such as changing firewall rules, adding new users to access the resources, adding new rules to access other resources from the current resource, and so on. 

Initially due to a lack of good GUI interfaces and available tools for monitoring and managing resources, it was managed using custom scripts or commands. Today almost all cloud providers have well augmented graphical user interfaces and tools...

Data pipelines


Data pipelines represent the flow of data from extraction to business reporting, which involves intermediate steps such as cleansing, transformation, and loading gold data to the reporting layer. The data pipelines may be either real-time, near mealtime, or batch. The storage system may be either distributed storage, like HDFS, S3 or distributed high throughput messaging queue, such as Kafka and Kinesis.

In this section, we will talk about how we can use cloud providers' tools all together to build data pipelines for customers. If you have already gone through the preceding section about the logical architecture in Hadoop, this section will be easy to understand. We have been continuously mentioning that every cloud provider available today has some equivalent tools available with respect to open source tools. Each cloud provider claims some rich set of features and have a benchmark test available to support their claim, but remember these benchmark test are specific to their...

High availability (HA)


High availability (HA) is a primary focus for all the framework or application available today. The application can be deployed either on-premise or over the cloud. There are many cloud service providers available today, such as Amazon AWS, Microsoft Azure, Google Cloud Platform, IBM Cloud, and so on. Achieving high availability using on-premise deployment has limited capability, for example even if we have multi node clusters available on-premise that have both HDFS storage and a processing engine up and running, it does not guarantee the required high level of availability in case any disaster happens. It requires also strict monitoring of the on-premise cluster in order to avoid any major loss and guaranteeing high availability. 

In other words, cloud services provide more robust and reliable high availability features and there are multiple scenarios for considering what level of high availability we want to achieve it. Let us look into few conceptual scenarios...

Summary


In this chapter, we have studied the logical view of Hadoop in the cloud and how the logical architecture for Hadoop would look over the cloud. We also learned about managing resources, which is a continuous process for either on-premise infrastructure or infrastructure on the cloud. We got introduced to the data pipelines and how we can use cloud providers' tools all together to build a data pipeline for customers. This chapter also focuses on High Availability, which is a primary focus for all the frameworks and applications available today. The application can be deployed either on-premise or over the cloud. 

In the next chapter, we will study Hadoop Cluster Profiling. 

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}