You're reading from Modern Data Architecture on AWS
Why does a performant and cost-effective data platform matter?
One of the key pillars of a modern data architecture on AWS is around the performance and cost of the data platform being built. Users of the platform are not going to wait 5 minutes for a report to load. Also, if an organization were to measure the return on investment from the data platform, getting a dollar’s worth of benefit is not sustainable if it costs them two dollars to get the result.
The performant and cost-effective pillar of modern data architecture on AWS matters for several reasons:
- Cost-efficiency: Optimizing costs is crucial for any organization. By implementing a cost-optimized data architecture, you can minimize unnecessary expenses and achieve a better return on investment. AWS provides a wide range of services and tools to help you control and optimize your data-related costs.
- Scalability: AWS offers highly scalable services that allow you to scale your data infrastructure based...
Data storage optimizations
In any data platform, the storage layer is the foundation since all the data across different systems inside the platform is stored in different types of storage. Even though the data storage cost is often not the most dominant part of the overall expenditure on the data platform, it can start to creep up if the best practices are not followed.
Let’s bring up a scenario that requires a deep dive into storage optimization.
Use case for storage optimization
GreatFin has established a data platform on AWS and uses many of the data and analytics services provided by AWS to operate different areas of the platform. After onboarding data from a variety of sources, the combined platform storage across all LOBs has grown to a petabyte scale. GreatFin’s storage infrastructure on AWS lacks optimization, leading to potential challenges such as high storage costs, performance bottlenecks, limited scalability, and inadequate data protection.
The...
Compute resource optimizations
In any typical data modern data platform that’s been built using AWS data and analytics services, the platform infrastructure expenses will be dominated by the compute expenses provided by many of the services. Take any service we discussed in this book, be it DMS for data ingestion, Glue and EMR for data processing, Kinesis and MSK for streaming data, Redshift for data warehouses, Athena for ad hoc analytics on the data lake, different SageMaker tools for ML, OpenSearch Service for operational analytics, QuickSight for business intelligence and many other supporting services – if you look at the overall cost of each of these services, you will find that the vast majority of the expense comes from the compute resources supporting these services. The reason is simple – CPUs/GPUs are significantly more expensive than storage, memory, and networking.
Compute resources are also one of the most important dimensions regarding the optimal...
Cost optimization tools
AWS provides several cost optimization tools that can help you manage and optimize your AWS spending. The following sections show some key cost optimization tools offered by AWS.
AWS Cost Explorer
AWS Cost Explorer is a built-in cost management tool that provides visibility into your AWS costs and usage. It allows you to analyze your costs, view historical spending patterns, and forecast future costs. You can drill down into specific cost categories, services, or regions to identify areas where cost optimizations can be made.
Cost Explorer allows you to look at different service spends for each month, as shown in the following screenshot. This gives you a good understanding of rising costs that might require optimization reviews:
Figure 16.14 – AWS Cost Explorer
AWS Budgets
AWS Budgets enables you to set cost and usage budgets for your AWS resources. You can define spending thresholds and receive alerts when your...
Tool-specific performance tuning
We covered a lot about optimizing the service infrastructure in this chapter. However, often, optimizations happen at the service or tool level. This is caused by changing the configurations of the service or fine-tuning the logic that runs on these services. Typically, tuning is done to improve performance, which also helps save costs. It will not be possible to cover every aspect of performance tuning in this section, but we will try to cover some of the obvious ones from some of the key services that help build a data platform on AWS.
Performance tuning measures on Amazon Redshift
Many aspects of performance tuning depend on root cause analysis; hence, we may not be able to cover every tunable in Redshift. Also, recent Redshift autonomics advancements have made a lot of tunable settings automatic now; things such as data distribution, sorting, and analyze and vacuum operations can all be made automatic by the service. However, some common tunable...
Summary
This chapter was all about ensuring that the data platform that’s built is performant as well as cost-effective. We started by understanding the need for a data platform that operates optimally. If any part of the data platform is either not performing well or is very expensive, it often creates a snowball effect and affects the business negatively.
A lot of cost optimization can be achieved by optimizing the infrastructure used by the AWS services under the covers. By optimizing storage and compute resources, we can save significant costs. We also looked at some of the tools AWS provides that help in the cost optimization process.
Finally, we looked at some of the service-specific tuning settings that can help with performance improvements. The list of such improvements can be quite long for each service, but the key message was to leverage the best practices for each service and always perform a WAR before deploying workloads in production.
In the next and...
References
Amazon Well-Architected Framework – Data Analytics Lens: https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/analytics-lens.html.