Reader small image

You're reading from  Big Data Analytics with Hadoop 3

Product typeBook
Published inMay 2018
PublisherPackt
ISBN-139781788628846
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Sridhar Alla
Sridhar Alla
author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

Right arrow

Chapter 12. Using Amazon Web Services

This chapter introduces you to the concept of AWS and its services, which are useful for performing big data analytics using Elastic MapReduce(EMR) while you set up a Hadoop cluster in AWS Cloud. We will look at the key components and services offered by AWS and get an idea of what we can do with the various functionalities offered by the components and services of AWS.

In a nutshell, the following topics will be covered in this chapter:

  • Amazon Elastic Compute Cloud
  • Launching multiple instances from an AMI
  • What is AWS Lambda?
  • Introduction to Amazon S3
  • Amazon DynamoDB
  • Amazon Kinesis Data Streams
  • AWS Glue
  • Amazon EMR

Amazon Elastic Compute Cloud


Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable computing capacity on a Cloud. It is designed to make web-scale Cloud computing easier for developers.

Amazon EC2's simple web service interface allows you to obtain and configure capacity with ease. It provides you with complete control of your computing resources and let's you use Amazon's computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity (both up and down), as your computing requirements change. Amazon EC2 allows you to save computing costs as you pay only for capacity that you actually use. Amazon EC2 provides developers with the tools to build failure-resilient applications and to isolate them from common failure scenarios.

Elastic web-scale computing

Amazon EC2 enables you to increase or decrease capacity within a span of minutes. You can commission one or several server...

Launching multiple instances of an AMI


Your instances keep running until you stop or terminate them, or until they fail. If an instance fails, you can launch a new one from the AMI.

Instances

You can launch different types of instance from a single AMI. An instance type essentially determines the hardware of the host computer used for your instance. Each instance type offers different compute and memory capabilities. Select an instance type based on the amount of memory and computing power that you need for the application or software that you plan to run on the instance. For more information about the hardware specifications for each Amazon EC2 instance type, see Amazon EC2 instances at this link https://aws.amazon.com/ec2/instance-types/.

After you launch an instance, it looks like a traditional host, and you can interact with it as you would any computer. You have complete control of your instances; you can use sudo to run commands that require root privileges.

AMIs

Amazon Web Services (AWS...

What is AWS Lambda?


AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda executes your code only when needed and scales automatically, from a few requests per day to thousands per second. You pay only for the compute time you consume—there is no charge when your code is not running. With AWS Lambda, you can run code for virtually any type of application or backend service, all with zero administration. AWS Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring, and logging. All you need to do is supply your code in one of the languages that AWS Lambda supports (currently Node.js, Java, C#, Go, and Python).

You can use AWS Lambda to run your code in response to events, such as changes to data in an Amazon S3 bucket or an Amazon DynamoDB table; to run...

Introduction to Amazon S3


Amazon S3 runs on the world's largest global Cloud infrastructure, and was built from the ground up to deliver a customer promise of 99.999999999% durability. Data is automatically distributed across a minimum of three physical facilities that are geographically separated within an AWS region, and Amazon S3 can also automatically replicate data to any other AWS region.

Learn more about the AWS Global Cloud Infrastructure at https://aws.amazon.com/.

Getting started with Amazon S3

Amazon S3 is storage for the internet. You can use Amazon S3 to store and retrieve any amount of data at any time, from anywhere on the web. You can accomplish these tasks using the AWS Management Console, which is a simple and intuitive web interface. This guide introduces you to Amazon S3 and how to use the AWS Management Console to manage the storage space offered by Amazon S3.

Companies today need the ability to easily and securely collect, store, and analyze their data on a massive scale...

Amazon DynamoDB


Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database so that you don't have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling. Also, DynamoDB offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data. For more information, see Amazon DynamoDB Encryptionat Rest at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EncryptionAtRest.html.

With DynamoDB, you can create database tables that can store and retrieve any amount of data, and serve any level of request traffic. You can scale up or scale down your tables throughput capacity without downtime or performance degradation, and use the AWS Management Console to monitor resource utilization and performance metrics...

Amazon Kinesis Data Streams


You can use Amazon Kinesis Data Streams to collect and process large streams of data records in real time. You'll create data-processing applications, known as Amazon Kinesis Data Streams applications. A typical Amazon Kinesis Data Streams application reads data from a Kinesis data stream as data records. These applications can use the Kinesis Client Library, and they can run on Amazon EC2 instances. The processed records can be sent to dashboards, used to generate alerts, dynamically change pricing and advertising strategies, or to send data to a variety of other AWS services. For information about Kinesis Data Streams features and pricing, see Amazon Kinesis Data Streams.

Kinesis Data Streams is part of the Kinesis streaming data platform, along with Amazon Kinesis Data Firehose. For more information, see the Amazon Kinesis Data Firehose Developer Guide. For more information about AWS big data solutions, see Big Data. For more information about AWS streaming...

AWS Glue


AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and job retries/reattempts on failure. AWS Glue is serverless, so there's no infrastructure to set up or manage.

Use the AWS Glue console to discover data, transform it, and make it available for searching and querying. The console calls the underlying services to orchestrate the work required to transform your data. You can also use the AWS Glue API operations to interface with AWS Glue services. Edit, debug, and test your Python or Scala Apache Spark ETL code using a familiar development environment.

When should I use AWS Glue?

You can use AWS Glue to build a data...

Amazon EMR


Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. You can also use Amazon EMR to transform and move large amounts of data in and out of other AWS data stores and databases, such as Amazon S3 and Amazon DynamoDB.

Amazon EMR provides a managed Hadoop framework that is easy, fast, and cost-effective in order to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis...

Summary


In this chapter, we have discussed AWS as a Cloud provider for Cloud computing needs.

In the next chapter, we will bring everything together to understand what it takes to realize the business goals of building a practical big data analytics practice.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with Hadoop 3
Published in: May 2018Publisher: PacktISBN-13: 9781788628846
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla