Reader small image

You're reading from  AWS Certified Solutions Architect ??? Associate Guide

Product typeBook
Published inOct 2018
PublisherPackt
ISBN-139781789130669
Edition1st Edition
Tools
Right arrow
Authors (2):
Gabriel Ramirez
Gabriel Ramirez
author image
Gabriel Ramirez

Gabriel Ramirez is a passionate technologist with a broad experience in the Software Industry, he currently works as an Authorized Trainer for Amazon Web Services and Google Cloud. He is holder of 9/9 AWS Certifications and does community work by organizing the AWS User Groups in Mexico.
Read more about Gabriel Ramirez

Stuart Scott
Stuart Scott
author image
Stuart Scott

Stuart Scott is the AWS content lead at Cloud Academy where he has created over 40 courses reaching tens of thousands of students. His content focuses heavily on cloud security and compliance, specifically on how to implement and configure AWS services to protect, monitor and secure customer data in an AWS environment. He has written numerous cloud security blogs Cloud Academy and other AWS advanced technology partners. He has taken part in a series of cloud security webinars to share his knowledge and experience within the industry to help those looking to implement a secure and trusted environment. In January 2016 Stuart was awarded 'Expert of the Year' from Experts Exchange for his knowledge share within cloud services to the community.
Read more about Stuart Scott

View More author details
Right arrow

Introducing Amazon Elastic MapReduce

The volume of data created by mankind is increasing massively. In the last two years, we have created more data than in the previous history of the human race—unstructured data grows every second. This is why new paradigms must be used to properly manage it.

The term big data is used more and more frequently, but what exactly is big data? How big is big data? It all depends on the perspective. Imagine a small company that works with spreadsheets accumulating data every year to the point where this tool is no longer useful. The company needs a new strategy such as relational databases and ERP software.

This same analogy works for big companies. Big data is using non-traditional methods to analyze vast amounts of information from different sources and types. Latency plays an important role in the big-data pipeline, because, depending on...

Technical requirements

You will need access to the CLI, Python 2.6.5 or higher, an IAM user with sufficient permissions to create roles, EC2 instances, and related resources. An AdministratorAccess policy can be used.

Clustering in AWS

Clustering is a way to group the compute resources physically. The nearest the better improving the communications performance and lowering jitter. Clusters can be tightly or loosely coupled and have a master node that performs all the orchestration activities of the compute nodes. Every cluster in AWS is a single Availability Zones (AZ) concept. To gain resilience, it can use specialized persistence services such as EFS, EBS, and Amazon S3.

There are two main groups of clusters in AWS, each one with a specific purpose:

  • Cluster HPC: This cluster is tightly coupled, and the network performance is a major concern. In this model, we use higher throughput instances, placement groups, jumbo frames, and single AZ compute nodes, and they need strong orchestration mechanisms. Examples of these technologies are media transcoding services and fraud risk analysis:
  • Distributed...

Placement groups

Placement groups are a great way to improve the network performance (the highest packets per second between instances) and the lowest latency for intensive applications by co-locating instances physically in the same hardware.

The spread placement groups extends the single hardware limitations of a placement group by using different distributed hardware, eliminating single points of failure.

Creating a placement group

  1. To create a placement group, navigate to EC2, and select Placement groups and Create Placement Group, as follows:
  1. The allocation of instances inside the placement group is a one-time-only action. If you want to modify the placement group by adding instances—you'll need to relaunch...

Elastic MapReduce

Elastic MapReduce (EMR) is a fully-managed cluster platform for running big-data and analytics frameworks such as Apache Hadoop, Spark, HBase, Presto, Impala, Cascading, and Flink. Running Hadoop clusters is a complex and time-consuming task. EMR provisions the cluster and installs frequently used frameworks for data scientists, analysts, and engineers.

EMR provides the flexibility to bootstrap your cluster, with a series of steps defined by the customer to install, configure, and prepare your data to be processed. EMR can use the Hadoop distributed file system on EBS volumes or EMRFS with Amazon S3 as the backing persistence service.

EMR clusters have a variety of use cases, from ETL and batch processing to real-time applications integrating Amazon Firehose or Apache Spark, and a wide number of connectors and integration architectures. Clusters on EMR can be...

Summary

In this chapter, you have learned about some of the options available for clustering in AWS. We remarked on the differences between Cluster HPC and Distributed Grids, and we created a cluster with the CfnCluster framework.

We also discussed some of the networking optimizations available at the hypervisor and interface level, and we learned how to inspect for jumbo frames capabilities and performed a TCP benchmark between instances and created a compute placement group.

We introduced EMR and learned how the MapReduce programming model works, creating an EMR cluster that performs aggregation from logs from a public dataset.

Further reading

The Mastering Hadoop 3 and the Big Data Architect's Handbook books are recommended. To deep dive into Extract Transform and Load (ETL) workflows, read about AWS Glue (https://aws.amazon.com/glue).

EMR works perfectly with Amazon S3 to build data lakes; for more information, go to the following link: https://aws.amazon.com/big-data/datalakes-and-analytics/, and for a deep understanding of Hadoop architecture and HDFS, use the following links:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Solutions Architect ??? Associate Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789130669
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Gabriel Ramirez

Gabriel Ramirez is a passionate technologist with a broad experience in the Software Industry, he currently works as an Authorized Trainer for Amazon Web Services and Google Cloud. He is holder of 9/9 AWS Certifications and does community work by organizing the AWS User Groups in Mexico.
Read more about Gabriel Ramirez

author image
Stuart Scott

Stuart Scott is the AWS content lead at Cloud Academy where he has created over 40 courses reaching tens of thousands of students. His content focuses heavily on cloud security and compliance, specifically on how to implement and configure AWS services to protect, monitor and secure customer data in an AWS environment. He has written numerous cloud security blogs Cloud Academy and other AWS advanced technology partners. He has taken part in a series of cloud security webinars to share his knowledge and experience within the industry to help those looking to implement a secure and trusted environment. In January 2016 Stuart was awarded 'Expert of the Year' from Experts Exchange for his knowledge share within cloud services to the community.
Read more about Stuart Scott