Reader small image

You're reading from  AWS for Solutions Architects - Second Edition

Product typeBook
Published inApr 2023
PublisherPackt
ISBN-139781803238951
Edition2nd Edition
Right arrow
Authors (4):
Saurabh Shrivastava
Saurabh Shrivastava
author image
Saurabh Shrivastava

Saurabh Shrivastava is a technology leader, author, inventor, and public speaker with over 18 years of experience in the IT industry. He currently works at Amazon Web Services (AWS) as a Global Solutions Architect Leader and enables global consulting partners and enterprise customers on their journey to the cloud. Saurabh led the AWS global technical partnerships, set his team's vision and execution model, and nurtured multiple new strategic initiatives. Saurabh has authored various blogs and whitepapers across a diverse range of technologies, such as big data, IoT, machine learning, and cloud computing. He is passionate about the latest innovations and their impact on our society and daily life. He holds a patent in the area of cloud platform automation. Before AWS, Saurabh worked as an enterprise solution architect, software architect, and software engineering manager in Fortune 50 enterprises, start-ups, and global product and consulting organizations.
Read more about Saurabh Shrivastava

Neelanjali Srivastav
Neelanjali Srivastav
author image
Neelanjali Srivastav

Neelanjali Srivastav is a technology leader, product manager, agile coach, and cloud practitioner with over 16 years of experience in the software industry. She currently works at Amazon Web Services (AWS) as a Senior Product Manager and enables global customers on their data journey to the cloud. Neelanjali evangelizes and enables AWS customer and partners in AWS database, analytics, and machine learning services. She sets the product vision and cultivates new products in incubation. Before AWS, Neelanjali led teams of software engineers, solutions architects, and systems analysts to modernize IT systems and develop innovative software solutions for large enterprises. Neelanjali has held multiple roles in the IT services industry and R&D, focusing on enterprise application management, cloud service management, and orchestration.
Read more about Neelanjali Srivastav

Alberto Artasanchez
Alberto Artasanchez
author image
Alberto Artasanchez

Alberto Artasanchez is a solutions architect with expertise in the cloud, data solutions, and machine learning, with a career spanning over 28 years in various industries. He is an AWS Ambassador and publishes frequently in a variety of cloud and data science publications. He is often tapped as a speaker on topics including data science, big data, and analytics. He has a strong and extensive track record of designing and building end-to-end machine learning platforms at scale. He also has a long track record of leading data engineering teams and mentoring, coaching, and motivating them. He has a great understanding of how technology drives business value and has a passion for creating elegant solutions to complicated problems.
Read more about Alberto Artasanchez

Imtiaz Sayed
Imtiaz Sayed
author image
Imtiaz Sayed

Imtiaz (Taz) Sayed leads the Worldwide Data Analytics Solutions Architecture community at AWS. He is a Principal Solutions Architect, and works with diverse customers engaging in thought leadership, strategic partnerships and specialized guidance on building modern data platforms on AWS.  He is a technologist with over 20 years of experience across several domains including distributed architectures, data analytics, service mesh, databases, and DevOps.
Read more about Imtiaz Sayed

View More author details
Right arrow

Data Warehouses, Data Queries, and Visualization in AWS

The decreasing cost of storage in the cloud means that businesses no longer need to choose which data to keep and which to discard. Additionally, with pay-as-you-go and on-demand storage and compute options available, analyzing data to gain insights is now more accessible. Businesses can store all relevant data points, even as they grow to massive volumes, and analyze the data in various ways to extract insights. This can drive innovation within an organization and result in a competitive advantage.

In Chapter 5, Storage in AWS – Choosing the Right Tool for the Job, you learned about the files and object storage services offered by AWS. In Chapter 7, Selecting the Right Database Service, we covered many of the AWS database services. Now, the question is how to query and analyze the data available in different storage and databases. One of the most popular ways to analyze structured data is using data warehouses and...

Data warehouses in AWS with Amazon Redshift

Data is a strategic asset for organizations, not just new businesses and gaming companies. The cost and difficulty of storing data have significantly reduced in recent times, making it an essential aspect of many companies’ business models. Nowadays, organizations are leveraging data to make informed decisions, such as launching new product offerings, introducing revenue streams, automating processes, and earning customer trust. These data-driven decisions can propel innovation and steer your business toward success.

You want to leverage your data to gain business insights, but this data is distributed into silos. For example, structured data resides in relational databases, semi-structured data is stored in object stores, and clickstream data that is streaming from the internet is stored in streaming storage. In addition, you also need to address emerging use cases such as machine learning (ML). Business users in your organization...

Querying your data lake in AWS with Amazon Athena

Water, water everywhere, and not a drop to drink… This may be the feeling you get in today’s enterprise environments. We are producing data at an exponential rate, but it is sometimes difficult to find a way to analyze this data and gain insights from it. Some of the data that we are generating at a prodigious rate is of the following types:

  • Application logging
  • Clickstream data
  • Surveillance video
  • Smart and IoT devices
  • Commercial transactions

Often, this data is captured without analysis or is at least not analyzed to the fullest extent. Analyzing this data properly can translate into the following:

  • Increased sales
  • Cross-selling opportunities
  • Avoiding downtime and errors before they occur
  • Serving customer bases more efficiently

Previously, one stumbling block to analyzing this data was that much of this information resided in flat files....

Deep-diving into Amazon Athena

As mentioned previously, Amazon Athena is quite flexible and can handle simple and complex database queries using standard SQL. It supports joins and arrays. It can use a wide variety of file formats, including these:

  • CSV
  • JSON
  • ORC
  • Avro
  • Parquet

It also supports other formats, but these are the most common. In some cases, the files you are using have already been created, and you may have little flexibility regarding the format of these files. But for the cases where you can specify the file format, it’s important to understand the advantages and disadvantages of these formats. In other cases, converting the files into another format may even make sense before using Amazon Athena. Let’s take a quick look at these formats and understand when to use them.

CSV files

A Comma-Separated Value (CSV) file is a file where a comma separator delineates each value, and a return character delineates...

Using Amazon Athena Federated Query

Unless your organization has specific requirements, it’s likely that you store data in various storage types, selecting the most appropriate storage type based on its purpose. For example, you may choose graph databases when they are the best fit, relational databases for certain use cases, and S3 object storage or Hadoop HDFS when they are the most suitable. Amazon Neptune (a graph database) may be the best choice if you are building a social network application. Or, if you are building an application that requires a flexible schema, Amazon DynamoDB may be a solid choice. AWS offers many different types of persistence solutions, such as these:

  • Relational database services
  • Key-value database services
  • Document database services
  • In-memory database services
  • Search database services
  • Graph database services
  • Time-series database services
  • Ledger databases database services
  • Plain object data...

Learning about Amazon Athena workgroups

Another new feature that comes with Amazon Athena is the concept of workgroups. Workgroups enable administrators to give different groups of users different access to databases, tables, and other Athena resources. It also enables you to establish limits on how much data a query or a whole workgroup can access, and provides the ability to track costs. Since workgroups act like any other resource in AWS, resource-level identity-based policies can be set up to control access to individual workgroups.

Workgroups can be integrated with SNS and CloudWatch as well. If query metrics are turned on, these metrics can be published to CloudWatch. Additionally, alarms can be created for certain workgroup users if their usage goes above a pre-established threshold.

By default, Amazon Athena queries run in the default primary workgroup. AWS administrators can add new workgroups and then run separate workloads in each workgroup. A common use case is...

Optimizing Amazon Athena

As with any SQL operation, you can take steps to optimize the performance of your queries and inserts. As with traditional databases, optimizing your data access performance usually comes at the expense of data ingestion and vice versa.

Let’s look at some tips that you can use to increase and optimize performance.

Optimization of data partitions

One way to improve performance is to break up files into smaller files called partitions. A common partition scheme breaks up a file by using a divider that occurs with some regularity in data. Some examples follow:

  • Country
  • Region
  • Date
  • Product

Partitions operate as virtual columns and reduce the amount of data that needs to be read for each query. Partitions are normally defined at the time a table or file is created.

Amazon Athena can use Apache Hive partitions. Hive partitions use this name convention:

s3://BucketName/TablePath/<PARTITION_COLUMN_NAME...

Using Amazon Athena versus Redshift Spectrum

Amazon Athena and Redshift Spectrum are two data querying services offered by AWS that allow users to analyze data stored in Amazon S3 using standard SQL.

Amazon Athena is a serverless interactive query service that quickly analyzes data in Amazon S3 using standard SQL. It allows users to analyze data directly from Amazon S3 without creating or managing any infrastructure. Athena is best suited for ad hoc querying and interactive analysis of large amounts of unstructured data that is stored in Amazon S3.

For example, imagine a marketing team needs to analyze customer behavior data stored in Amazon S3 to make informed decisions about their marketing campaigns. They can use Athena to query the data in S3, extract insights, and make informed decisions about how to improve their campaigns.

On the other hand, Amazon Redshift Spectrum (an extension of Amazon Redshift) allows users to analyze data stored in Amazon S3 with the same...

Visualizing data with Amazon QuickSight

Data is an organizational asset that needs to be available easily and securely to anyone who needs access. Data is no longer solely the property of analysts and scientists. Presenting data simply and visually enables teams to make better and more informed decisions, improve efficiency, uncover new opportunities, and drive innovation.

Most traditional on-premises business intelligence solutions come with a client-server architecture and have minimum licensing requirements. To start with business intelligence tools, you must sign up for annual commitments around users or servers, requiring upfront investments. You will need to build extensive monitoring and management, infrastructure growth, patches for software, and periodic data backups to keep your systems in compliance. On top of that, if you want to deliver data and insights to your customers and other third parties, this usually requires separate systems and tools for each audience.

...

Putting AWS analytic services together

In the previous chapter, Chapter 10, Big Data and Streaming Data Processing in AWS, you learned about AWS ETL services such as EMR and Glue. In this chapter, let’s combine that with learning how to build a data processing pipeline. The following diagram shows a data processing and analytics architecture in AWS that applies various analytics services to build an end-to-end solution:

Figure 11.6: Data analytic architecture in AWS

As shown in the preceding diagram, data is ingested from various sources such as operational systems, marketing, and other systems in S3. You want to ingest data fast without losing it, so this data is collected in a raw format first. You can clean, process, and transform this data using an ETL platform such as EMR or Glue. Using the Apache Spark framework and writing data processing code from scratch is recommended when using Glue; otherwise, you can use EMR if you have Hadoop skill sets in your team...

Summary

In this chapter, you learned how to query and visualize data in AWS. You started with learning about Amazon Redshift, your data warehouse in the cloud, before diving deeper into the Redshift architecture and learning about its key capabilities. Further, you learned about Amazon Athena, a powerful service that can “convert” any file into a database by allowing us to query and update the contents of that file by using the ubiquitous SQL syntax.

You then learned how we could add some governance to the process using Amazon Athena workgroups and how they can help us to control access to files by adding a level of security to the process. As you have learned throughout the book, there is not a single AWS service (or any other tool or product, for that matter) that is a silver bullet for solving all problems. Amazon Athena is no different, so you learned about some scenarios where Amazon Athena is an appropriate solution and other use cases where other AWS services...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS for Solutions Architects - Second Edition
Published in: Apr 2023Publisher: PacktISBN-13: 9781803238951
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (4)

author image
Saurabh Shrivastava

Saurabh Shrivastava is a technology leader, author, inventor, and public speaker with over 18 years of experience in the IT industry. He currently works at Amazon Web Services (AWS) as a Global Solutions Architect Leader and enables global consulting partners and enterprise customers on their journey to the cloud. Saurabh led the AWS global technical partnerships, set his team's vision and execution model, and nurtured multiple new strategic initiatives. Saurabh has authored various blogs and whitepapers across a diverse range of technologies, such as big data, IoT, machine learning, and cloud computing. He is passionate about the latest innovations and their impact on our society and daily life. He holds a patent in the area of cloud platform automation. Before AWS, Saurabh worked as an enterprise solution architect, software architect, and software engineering manager in Fortune 50 enterprises, start-ups, and global product and consulting organizations.
Read more about Saurabh Shrivastava

author image
Neelanjali Srivastav

Neelanjali Srivastav is a technology leader, product manager, agile coach, and cloud practitioner with over 16 years of experience in the software industry. She currently works at Amazon Web Services (AWS) as a Senior Product Manager and enables global customers on their data journey to the cloud. Neelanjali evangelizes and enables AWS customer and partners in AWS database, analytics, and machine learning services. She sets the product vision and cultivates new products in incubation. Before AWS, Neelanjali led teams of software engineers, solutions architects, and systems analysts to modernize IT systems and develop innovative software solutions for large enterprises. Neelanjali has held multiple roles in the IT services industry and R&D, focusing on enterprise application management, cloud service management, and orchestration.
Read more about Neelanjali Srivastav

author image
Alberto Artasanchez

Alberto Artasanchez is a solutions architect with expertise in the cloud, data solutions, and machine learning, with a career spanning over 28 years in various industries. He is an AWS Ambassador and publishes frequently in a variety of cloud and data science publications. He is often tapped as a speaker on topics including data science, big data, and analytics. He has a strong and extensive track record of designing and building end-to-end machine learning platforms at scale. He also has a long track record of leading data engineering teams and mentoring, coaching, and motivating them. He has a great understanding of how technology drives business value and has a passion for creating elegant solutions to complicated problems.
Read more about Alberto Artasanchez

author image
Imtiaz Sayed

Imtiaz (Taz) Sayed leads the Worldwide Data Analytics Solutions Architecture community at AWS. He is a Principal Solutions Architect, and works with diverse customers engaging in thought leadership, strategic partnerships and specialized guidance on building modern data platforms on AWS.  He is a technologist with over 20 years of experience across several domains including distributed architectures, data analytics, service mesh, databases, and DevOps.
Read more about Imtiaz Sayed