AWS Certified DevOps Engineer - Professional Certification and Beyond

By Adam Book
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Amazon Web Service Pillars
About this book

The AWS Certified DevOps Engineer certification is one of the highest AWS credentials, vastly recognized in cloud computing or software development industries. This book is an extensive guide to helping you strengthen your DevOps skills as you work with your AWS workloads on a day-to-day basis.

You'll begin by learning how to create and deploy a workload using the AWS code suite of tools, and then move on to adding monitoring and fault tolerance to your workload. You'll explore enterprise scenarios that'll help you to understand various AWS tools and services. This book is packed with detailed explanations of essential concepts to help you get to grips with the domains needed to pass the DevOps professional exam. As you advance, you'll delve into AWS with the help of hands-on examples and practice questions to gain a holistic understanding of the services covered in the AWS DevOps professional exam. Throughout the book, you'll find real-world scenarios that you can easily incorporate in your daily activities when working with AWS, making you a valuable asset for any organization.

By the end of this AWS certification book, you'll have gained the knowledge needed to pass the AWS Certified DevOps Engineer exam, and be able to implement different techniques for delivering each service in real-world scenarios.

Publication date:
November 2021


Chapter 1: Amazon Web Service Pillars

DevOps is, at its heart, a combination of the skills of development and operations and breaking down the walls between these two different teams. DevOps includes enabling developers to perform operational tasks easily. DevOps also involves empowering operational team members to create their Infrastructure as Code and use other coding techniques, such as continuous integration pipelines, to spin up the same infrastructure in multiple regions quickly.

In this book, we will go through the services and concepts that are part of the DevOps professional exam so that you have a solid understanding from a practical standpoint, in terms of both explanations and hands-on exercises.

Becoming Amazon Web Services (AWS) Certified not only gives you instant validation of the technical skills that you hold and maintain – it also strengthens you as a technical professional. The AWS DevOps Engineer Professional Certification is a cumulative test that incorporates the base knowledge of fundamental AWS services, including system operations capabilities for running, managing, and monitoring workloads in AWS. This is in addition to developing and deploying code to functions, containers, and instances.

We go look at the test itself in more depth in Chapter 23, Overview of the DevOps Professional Certification Test, as well as provide tips for taking the exam.

The AWS pillars are the five guiding principles that guide architects and developers in generally accepted cloud architecture and design. They are subtly referenced in the DevOps Professional exam, but the pillars and their guidelines are tenets of best practices for working with any cloud service provider – especially Amazon Web Services. These are all guiding principles in DevOps practices and pipelines, and having a sound understanding of these five items will not only help you come exam time, but serve you throughout your DevOps career journey.

In this chapter, we're going to cover the following main topics:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

Service pillars overview

At first glance, you may be wondering why we aren't just jumping right into AWS, continuous integration/continuous delivery (CI/CD), and other DevOps topics. The main reason is that these five pillars are the foundational fabric of the exams. In addition, they will help you provide the most effective, dependable, and efficient environment for your company or clients. These design principles are not only important when architecting for success on Amazon Web Services, or any cloud provider for that matter, but in guiding the practices that you use throughout your day-to-day endeavors.

Once you become familiar with these pillars, you will see them and their themes in the testing questions as you go down your path for certification. This is especially true when working to obtain the DevOps Professional Certification as there are specific sections for Operations, Security, and Reliability.

The following are the five pillars of a well-architected framework:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

Use these pillars as the guiding principles, not only for designing your workloads in AWS but also for improving and refactoring current workloads. Every organization should strive to achieve well-architected applications and systems. Therefore, improving any AWS applications you are working on will make you a valuable asset. Now, let's look at each of these pillars in detail.


Operational excellence

As we look at the operational excellence pillar, especially in the context of DevOps, this is one – if not the most – important service pillar for your day-to-day responsibilities. We will start by thinking about how our teams are organized; after all, the DevOps movement came about from breaking down silos between Development and Operations teams.

Question – How does your team determine what its priorities are?

* Does it talk to customers (whether they're internal or external)?

* Does it get its direction from product owners who have drawn out a roadmap?

Amazon outlines five design principles that incorporate operational excellence in the cloud:

  • Performing Operations as Code
  • Refining operations frequently
  • Making small, frequent, and reversible changes
  • Anticipating failure
  • Learning from all operational failures

Let's take a look at each of these operational design principals in detail to see how they relate to your world as a DevOps engineer. As you go through the design principles of not only this pillar but all the service pillars, you will find that the best practices are spelled out, along with different services, to help you complete the objective.

Performing Operations as Code

With the contrivance of Infrastructure as Code, the cloud allows teams to create their applications using code alone, without the need to interact with a graphical interface. Moreover, it allows any the underlying networking, services, datastores, and more that's required to run your applications and workloads. Moving most, if not all, the operations to code does quite a few things for a team:

  • Distributes knowledge quickly and prevents only one person on the team from being able to perform an operation
  • Allows for a peer review of the environment to be conducted, along with quick iterations
  • Allows changes and improvements to be tested quickly, without the production environment being disrupted

In AWS, you can perform Operations as Code using a few different services, such as CloudFormation, the Cloud Development Kit (CDK), language-specific software development kits (SDK), or by using the command-line interface (CLI).

Refining operations frequently

As you run your workload in the cloud, you should be in a continual improvement process for not only your application and infrastructure but also your methods of operation. Teams that run in an agile process are familiar with having a retrospective meeting after each sprint to ask three questions: what went well, what didn't go well, and what has room for improvement?

Operating a workload in the cloud presents the same opportunities for retrospection and to ask those same three questions. It doesn't have to be after a sprint, but it should occur after events such as the following:

  • Automated, manual, or hybrid deployments
  • Automated, manual, or hybrid testing
  • After a production issue
  • Running a game day simulation

After each of these situations, you should be able to look at your current operational setup and see what could be better. If you have step-by-step runbooks that have been created for incidents or deployments, ask yourself and your team whether there were any missing steps or steps that are no longer needed. If you had a production issue, did you have the correct monitoring in place to troubleshoot that issue?

Making small, frequent, and reversible changes

As we build and move workloads into the cloud, instead of placing multiple systems on a single server, the best design practices are to break any large monolith designs into smaller, decoupled pieces. With the pieces being smaller, decoupled, and more manageable, you can work with smaller changes that are more reversible, should a problem arise.

The ability to reverse changes can also come in the form of good coding practices. AWS CodeCommit allows Git tags in code repositories. By tagging each release once it has been deployed, you can quickly redeploy a previous version of your working code, should a problem arise in the code base. Lambda has a similar feature called versions.

Anticipating failure

Don't expect that just because you are moving to the cloud and the service that your application is relying on is labeled as a managed service, that you no longer need to worry about failures. Failures happen, maybe not often; however, when running a business, any sort of downtime can translate into lost revenue. Having a plan to mitigate risks (and also test that plan) can genuinely mean the difference in keeping your service-level agreement (SLA) or having to apologize or, even worse, having to give customers credits or refunds.

Learning from failure

Things fail from time to time, but when they do, it's important not to dwell on the failures. Instead, perform post-mortem analysis and find the lessons that can make the team and the workloads stronger and more resilient for the future. Sharing learning across teams helps bring everyone's perspective into focus. One of the main questions that should be asked and answered after failure is, Could the issue be resolved with automatic remediation?

One of the significant issues in larger organizations today is that in their quest of trying to be great, they stop being good. Sometimes, you need to be good at the things you do, especially on a daily basis. It can be a steppingstone to greatness. However, the eternal quest for excellence without the retrospective of what is preventing you from becoming good can sometimes be an exercise in spinning your wheels, and not gaining traction.

Example – operational excellence

Let's take a look at the following relevant example, which shows the implementation of automated patching for the instances in an environment:

Figure 1.1 – Operational excellence – automated patching groups

Figure 1.1 – Operational excellence – automated patching groups

If you have instances in your environment that you are self-managing and need to be updated with patch updates, then you can use System Manager Patch Manager to help automate the task of keeping your operating systems up to date. This can be done on a regular basis using a Systems Manager Maintenance Task.

The initial step would be to make sure that the SSM agent (formally known as Simple Systems Manager) is installed on the machines that you want to stay up to date with patching.

Next, you would create a patching baseline, which includes rules for auto-approving patches within days of their release, as well as a list of both approved and rejected patches.

After that, you may need to modify the IAM role on the instance to make sure that the SSM service has the correct permissions.

Optionally, you can set up patch management groups. In the preceding diagram, we can see that we have two different types of servers, and they are both running on the same operating system. However, they are running different functions, so we would want to set up one patching group for the Linux servers and one group for the Database servers. The Database servers may only get critical patches, whereas the Linux servers may get the critical patches as well as the update patches.



Next is the Security pillar of the AWS Well-Architected Framework. Today, security is at the forefront of everyone's minds. Bad actors are consistently trying to find vulnerabilities in any code and infrastructure (both on-premises and in the cloud). When looking back at the lessons learned from the first 10 years of AWS, CTO Werner Vogels said Protecting your customers should always be your number one priority… And it certainly has been for AWS. (Vogels, 2016)

It is everyone's job these days to have secure practices across all cloud systems. This (protection) includes the infrastructure and networking components that serve the application and using secure coding practices and data protection, ultimately ensuring that the customer has a secure experience.

When you think about security, there are four main areas that the security pillar focuses on. They are shown in the following diagram:

Figure 1.2 – The four main areas of security in the security pillar

Figure 1.2 – The four main areas of security in the security pillar

The security pillar is constructed of seven principles:

  • Implementing a strong identity foundation
  • Enabling traceability
  • Applying security at all layers
  • Automating security best practices
  • Protecting data in transit and at rest
  • Keeping people away from data
  • Preparing for security events

As we move through this book, you will find practical answers and solutions to some of the security principles introduced here in the security pillar. This will help you develop the muscle memory needed to instill security in everything you build, rather than putting your piece out there and letting the security team worry about it. Remember, security is everyone's responsibility. Initially, we will look at these security principles in a bit more detail.

Implementing a strong identity foundation

When building a strong identity foundation, it all starts with actualizing the principle of least privilege. No user or role should have more or less permissions than it actually needs to perform its job or duties. Taking this a step further, if you are using IAM to manage your users, then ensure that a password policy is in place to confirm that passwords are being rotated on a regular basis, and that they don't become too stale. It is also a good idea to check that the IAM password policy is in sync with your corporate password policy.

Also, as your organization grows and managing users and permissions starts to become a more complex task, you should look to establish central identity management either with Amazon Single Sign-on or by connecting a corporate Active Directory server.

Enabling traceability

Security events can leave you in a reactive state; however, your ability to react can rely on the amount of information you can gather about the event. Putting proper monitoring, logging, alerting, and the ability to audit your environments and systems in place before an event happens is crucial to being able to perform the correct assessments and steps, when the need arises.

Capturing enough logs from a multitude of sources can be done with AWS services such as CloudWatch Logs, VPC Flow Logs, CloudTrail, and others. We will look at logging and monitoring extensively in Part 3 of this book as it is important to the DevOps Professional exam.

Think about the following scenario:

Someone has gained access to a server via a weak password and compromised some data. You feel that you are currently capturing many logs; however, would you be able to figure out the following?

  • The username used to access the system
  • The IP address that was used where the access originated
  • The time access was started
  • The records that were changed, modified, or deleted
  • How many systems were affected

Applying security at all layers

Securing all the levels of your environment helps protect you by giving your actions an extra expansiveness throughout your environment. To address network-level security, different VPCs can be secured using simple techniques such as Security Groups and Network ACLs. Seasoned AWS professionals know that additional security layers add an expansive security footprint – for example, at the edge (network access points to the AWS cloud), at the operating system level, and even making a shift left to secure the application code itself.

Automating security best practices

As you and your team get more educated about secure practices in the cloud, repetitive tasks should become automated. This allows you to react quicker to events that are happening and even react when you don't realize when things are happening.

This should be a topic when you start to dive in headfirst. As a DevOps specialist, you are used to taking repetitive manual processes and making them more efficient with automation. Automation can take the form of automatically analyzing logs, removing or remediating resources that don't comply with your organization's security posture, and intelligently detecting threats.

Amazon Web Services has come out with tools to help with this process, including GuardDuty, CloudWatch, EventHub, and AWS Config.

Protecting data in transit and at rest

Bad actors are all around, constantly looking for exploitable data that is traveling across the internet unprotected. You definitely can't rely on end users to use best practices such as secure communications over VPN, so it is up to you and your team to put the best practices in place on the server side. Basic items such as implementing certificates on your load balancers, on your CloudFront distribution, or even at the server level allows transmissions to be encrypted while going from point to point.

On the same token, figuratively speaking, making sure that you authenticate network communications either by enabling Transport Layer Security (TLS) or IPsec at the protocol layer helps ensure that network communications are authenticated.

There are AWS services to help protect your data, both in transit and at rest, such as AWS Certificate Manager, AWS Shield, AWS Web Application Firewall (the other WAF), and Amazon CloudFront. The Key Management Service (KMS) can also help protect your data at rest by allowing you to create, use, and rotate cryptographic keys easily.

For a deeper look at protecting data in transit and at rest, see Chapter 19, Protecting Data in Flight and at Rest.

Using mechanisms to keep people away from data

There are ways to automate how data is accessed, rather than allowing individuals to directly access the data. It is a better idea to have items that can be validated through a change control process. These would be items, such as System Manager runbooks or Lambda Functions, that would access the data. The opposite of this would be allowing direct access to data sources through either a bastion host or Elastic IP address/CNAME.

Providing this direct access can either lead to human mistakes or having a username and password compromised, which will ultimately lead to data loss or leakage.

Preparing for security events

Even if you enact all the security principles described previously, there is no guarantee that a security event won't be coming in the future. You are much better off practicing and having a prepared set of steps to enact quickly in case the need ever arises.

You may need to create one or more runbooks or playbooks that outline the steps of how to do things such as snapshotting an AMI for forensic analysis and moving it to a secured account (if available). If the time comes when these steps are necessary, there will be questions coming from many different places. The answers will have a timeline aspect to them. If the team whose responsibility is to perform these duties has never even practiced any of these tasks, nor has a guide been established to help them through the process, then valuable cycles will be wasted, just trying to get organized.

The following is the Shared Responsibility Model between AWS and you, the customer:

Figure 1.3 – The AWS shared responsibility model

Figure 1.3 – The AWS shared responsibility model

Questions to ask

* How do you protect your root account?

- Is there a Multi-Factor Authentication (MFA) device on the root account?

- Is there no use of the root account?

* How do you assign IAM users and groups?

* How do you delegate API/CLI access?

Next, let's learn about the five design principles for reliability in the cloud.



There are five design principles for reliability in the cloud:

  • Automating recover from failure
  • Testing recovery procedures
  • Scaling horizontally to increase aggregate workload availability
  • Stopping guessing capacity
  • Managing changes in automation

Automating recovery from failure

When you think of automating recovery from failure, the first thing most people think of is a technology solution. However, this is not necessarily the context that is being referred to in the reliability service pillar. These points of failure really should be based on Key Performance Indicators (KPIs) set by the business.

As part of the recovery process, it's important to know both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) of the organization or workload:

  • RTO (Recovery Time Objective): The maximum acceptable delay between the service being interrupted and restored
  • RPO (Recovery Point Objective): The maximum acceptable amount of time since the last data recovery point (backup) (Amazon Web Services, 2021)

Testing recovery procedures

In your cloud environment, you should not only test your workload functions properly, but also that they can recover from single or multiple component failures if they happen on a service, Availability Zone, or regional level.

Using practices such as Infrastructure as Code, CD pipelines, and regional backups, you can quickly spin up an entirely new environment. This could include your application and infrastructure layers, which will give you the ability to test that things work the same as in the current production environment and that data is restored correctly. You can also time how long the restoration takes and work to improve it by automating the recovery time.

Taking the proactive measure of documenting each of the necessary steps in a runbook or playbook allows for knowledge sharing, as well as fewer dependencies on specific team members who built the systems and processes.

Scaling horizontally to increase workload availability

When coming from a data center environment, planning for peak capacity means finding a machine that can run all the different components of your application. Once you hit the maximum resources for that machine, you need to move to a bigger machine.

As you move from development to production or as your product or service grows in popularity, you will need to scale out your resources. There are two main methods for achieving this: scaling vertically or scaling horizontally:

Figure 1.4 – Horizontal versus vertical scaling

Figure 1.4 – Horizontal versus vertical scaling

One of the main issues with scaling vertically is that you will hit the ceiling at some point in time, moving to larger and larger instances. At some point, you will find that there is no longer a bigger instance to move up to, or that the larger instance will be too cost-prohibitive to run.

Scaling horizontally, on the other hand, allows you to gain the capacity that you need at the time in a cost-effective manner.

Moving to a cloud mindset means decoupling your application components, placing multiple groupings of the same servers behind a load balancer, or pulling from a queue and optimally scaling up and down based on the current demand.

Stop guessing capacity

If resources become overwhelmed, then they have a tendency to fail, especially on-premises, as demands spike and those resources don't have the ability to scale up or out to meet demand.

There are service limits to be aware of, though many of them are called soft limits. These can be raised with a simple request or phone call to support. There are others called hard limits. They are set a specified number for every account, and there is no raising them.


Although there isn't a necessity to memorize all these limitations, it is a good idea to become familiar with them and know about some of them since they do show up in some of the test questions – not as pure questions, but as context for the scenarios.

Managing changes in automation

Although it may seem easier and sometimes quicker to make a change to the infrastructure (or application) by hand, this can lead to infrastructure drift and is not a repeatable process. A best practice is to automate all changes using Infrastructure as Code, a code versioning system, and a deployment pipeline. This way, the changes can be tracked and reviewed.


Performance efficiency

If you and your architectural design team are coming from a data center infrastructure and a provisioning process takes weeks or months to get the system you need, then the quickness and availability of cloud resources is certainly a breath of fresh air. There is a need to understand how to select either the correct instance type or compute option (that is, server-based, containerized, or function-based compute) based on the workload requirements.

Once you have made an initial selection, a benchmarking process should be undertaken so that you can see if you are utilizing all the CPU and memory resources that you have allocated, as well as to confirm that the workload can handle the duty that it is required to handle. As you select your instance types, don't forget to factor in costs and make a note of the cost differences that could either save you money or cost you more as you perform your baseline testing.

AWS provides native tools to create, deploy, and monitor benchmark tests, as shown in the following diagram:

Figure 1.5 – Baseline testing with AWS tooling

Figure 1.5 – Baseline testing with AWS tooling

Using the tools provided by AWS, you can quickly spin up an environment for right-sizing, benchmarking, and load testing the initial value that you chose for your compute instance. You can also easily swap out other instance types to see how performant they are with the same test. Using CloudFormation to build the infrastructure, you can, in a quick and repeated fashion, run the tests using CodeBuild, all while gathering the metrics with CloudWatch to compare the results to make sure that you have made the best decision – with data to back up that decision. We will go into much more detail on how to use CodeBuild in Chapter 7, Using CloudFormation Templates to Deploy Workloads.

The performance efficiency pillar includes five design principles to help you maintain efficient workloads in the cloud:

  • Making advanced technologies easier for your team to implement
  • Being able to go global in minutes
  • Using serverless architectures
  • Allowing your teams to experiment
  • Using technology that aligns with your goals

Making advanced technologies easier for your team to implement

Having the ability to use advanced technologies has become simplified in the cloud with the advent of managed services. No longer do you need full-time DBAs on staff who specialize in each different flavor of database, to test whether Postgres or MariaDB will perform in a more optimal fashion. In the same way, if you need replication for that database, you simply check a box, and you instantly have a Highly Available setup.

Time that would otherwise be spent pouring over documentation, trying to figure out how to install and configure particular systems, is now spent on the things that matter the most to your customers and your business.

Being able to go global in minutes

Depending on the application or service you are running, your customers may be centralized into one regional area, or they may be spread out globally. Once you have converted your infrastructure into code, there are built-in capabilities, either through constructs in CloudFormation templates or the CDK, that allow you to use regional parameters to quickly reuse a previously built pattern or architecture and deploy it to a new region of the world.

Even without deploying your full set of applications and architecture, there are still capabilities that allow you to serve a global audience using the Content Delivery Network (CDN) known as CloudFront. Here, you can create a secure global presence using the application or deploy content in the primary region, which is the origin.

Using serverless architectures

First and foremost, moving to serverless architectures means servers are off your to-do list. This means no more configuring servers with packages at startup, no more right-sizing servers, and no more patching servers.

Serverless architectures also mean that you have decoupled your application. Whether you are using functions, events, or microservices, each of these should be doing a specific task. And with each component doing only their distinct task, it allows you to fine-tune memory and utilize CPU at the task level, as well as scale out at a particular task level.

This is not the best option for every workload, but don't allow a workload to be disqualified just because it would need a little refactoring. When an application can be moved to a serverless architecture, it can make life easier, the application itself more efficient, and there are usually cost savings to reap as a result – especially in the long run.

Allowing your teams to experiment

Once you move to the cloud, you can quickly and constantly refactor your workload to improve it for both performance and cost. If you have built your Infrastructure as Code, creating a new temporary environment just for testing can be a quick and cost-efficient way to try new modular pieces of your application, without having to worry about disrupting any customers or other parts of the organization.

Many of the experiments may not work, but that is the nature of experimentation. Business is extremely competitive in this day and age, and finding an item that does work and makes your service faster, cheaper, and better can be a real game changer.

Using technology that aligns with your workload's goals

List your business goals and let the product owner help drive some of the product and service selections based on those goals. If a development team has previous familiarity with certain technologies, they may be inclined to sway toward those technologies that they already feel confident using.

On the other hand, there are other teams that strive to use the latest and greatest technologies – but not necessarily because the technology solves a problem that has been identified. Rather, they are interested in constantly resume-building and making sure that they have both exposure to and experience with cutting-edge services as soon as they become available.


Cost optimization

Many have the misconception that moving to the cloud will become an instant cost saver for their company or organization. A stark reality is faced once more, and more teams find out that provisioning new resources is as easy as clicking a button. Once the bills start appearing from an environment that doesn't have strict guardrails or the ability to chargeback the workloads to the corresponding teams, most of the time, there comes a cost reduction movement from the top down.

As you look to optimize your costs, understand that cloud services that have been proven to be managed services come at a high cost per minute; however, the human resources cost is much lower. There is no need to care and feed the underlying servers, nor worry about updating the underlying operating systems. Many of the services allow you to scale to user demands, and this is taken care of for you automatically.

The ability to monitor cost and usage is also a key element in a cost optimization strategy. Having a sound strategy for resource tagging allows those who are responsible for financially managing the AWS account to perform chargebacks for the correct department.

There are five design principles for cost optimization in the cloud:

  • Implementing cloud financial management
  • Adopting a consumption model
  • Measuring overall efficiency
  • Stop spending money on undifferentiated heavy lifting
  • Analyzing and attributing expenditure

Implementing cloud financial management

Cloud financial management is something that is starting to grow across organizations, both big and small, rapidly. It takes a dedicated team (or a group of team members that has been partially allocated this responsibility) to build out the ability to see where the cloud spend is going. This part of the organization will be looking at the cost usage reports, setting the budget alarms, tracking the spend, and hopefully enforcing a costing tag that can show the chargebacks for each department, cost center, or project.

What is a chargeback?

An IT chargeback is a process that allows units to associate costs with specific departments, offices, or projects to try and accurately track the spend of IT. We are specifically referring to cloud spend in this example, but IT chargebacks are used in other areas of IT as well, such as for accounting purposes.

Adopting a consumption model

Using the cloud doesn't require a sophisticated forecasting model to keep costs under control, especially when you have multiple environments. Development and testing environments should have the ability to spin down or be suspended when not in use, hence saving on charges when they would otherwise be sitting idle. This is the beauty of on-demand pricing in the cloud. If developers gripe over the loss of data, then educate them on the use of snapshotting database instances before shutting them down; then, they can start their development from where they left off.

An even better strategy is to automate the process of shutting down the development and test environments when the workday is finished and require a specialized tag, which would prevent an instance from being shut down after hours or on the weekend. You can also automate the process of restarting instances 30 to 60 minutes before the workday begins so that there is ample time for operating systems to become functional, allowing the team to think that they had never been turned off in the first place. Just be sure to watch out for any EC2 instances running on the instance store that may lose data.

Measuring overall efficiency

One of the most evident ways that organizations lose efficiency when it comes to their cloud budgets is neglecting to decommission unused assets. Although it is easier to spin up new services and create backups in the cloud, not having a plan to retire depreciated data, volume backups, machine images, log files, and other items adds to the bottom line of the monthly bill. This should be done with a scalpel and not a machete. Data, once deleted, is gone and irretrievable; however, there is no need to keep everything forever. Even with compliance, there is a fade-out period, and data can be stored in cold storage at a much more reasonable rate.

A perfect example is EBS snapshots. A customer who is trying to be proactive about data protection may be both snapshotting volumes multiple times per day as well as copying those snapshots to a Disaster Recovery region. If there is no way to depreciate the old snapshots after 30, 60, or 90 days, then this cost center item can become an issue rather quickly.

Stop spending money on undifferentiated heavy lifting

When we talk about heavy lifting, we're talking about racking, stacking, and cooling servers in a data center. Running a data center is a 24/7/365 endeavor, and you can't easily turn off machines and storage when you're not using them. Moving workloads to the cloud takes the onus of running those data centers off your team members and allows more time and effort to go into focusing on customer needs and features, rather than caring for and feeding servers and hardware.

Analyzing and attributing expenditure

The cloud – and the AWS cloud, in particular – has tools available to help you analyze and reference where the charges for your account(s) are coming from. The first tool in your toolbox is tags and a tagging strategy. Once you have decided on a solid set of base tags, including things such as cost center, department, and application, then you have a foundation for the rest of the tools that are available to you.

Breaking out from a single account structure into multiple accounts and organizational units using AWS Organizations can automatically categorize spend, even without the use of tags at the account level.

AWS Cost Explorer allows your financial management team to dig into the services and regions where spend is occurring, as well as create automatic dashboards to visualize the spend quickly. Amazon Web Services also has pre-set service quotas in place, some of which are hard quotas that cannot be changed, but many of which are soft quotas that allow you to raise the number of particular services (in a region) in a simple request.


Overarching service pillar principals

The Well-Architected Framework identifies a set of general design principles to facilitate good design in the cloud:

  • Test systems at full production scale.
  • Automate as many components as possible to make experimentation as easy as possible.
  • Drive architectures using data.
  • Stop guessing capacity needs.
  • Allow architectures to evolve with new technologies and services.
  • Use game days to drive team improvement.

As you are thinking about these service principals and how to put them into practice, realize that sometimes, the principles can feel like they are contradicting each other. The most obvious case is with the cost optimization pillar. If this is the pillar that the organization you are working for is trying to give the most attention, the other pillars can get in the way of pure cost savings. Strengthening weaknesses that you have found in the reliability pillar, most times, means more assets, and assets mean money. However, you can still strive to make those assets as cost-effective as possible so that you comply with all the pillars.



In this chapter, we learned about the five service principals that guide architects and developers to be well architected. We talked about how these are the underlying themes that run through the test questions in the DevOps pro exam, and how having this foundational knowledge can help when you're trying to determine the correct answer to a question. As we discussed each service pillar, we also talked about their underlying design principals.

We briefly mentioned several different AWS services and which service pillar or design principals where specific services come into play. In the next chapter, we will learn about the fundamental AWS services that are used throughout the environments and accounts you will be working in.


Review questions

  1. What are the five pillars of the Well-Architected Framework?
  2. What are the five main areas that security in the cloud consists of?
  3. What are the four areas that the performance efficiency pillar consists of?
  4. What are the three areas that the reliability pillar consists of?
  5. What is the definition of RTO?
  6. What is the definition of RPO?

Review answers

  1. Cost Optimization, Reliability, Operational Excellence, Performance Efficiency, and Security. (Use the Mnemonic CROPS to help remember the five pillars using the first letter of each pillar.)
  2. Data protection, infrastructure protection, privilege management, incident response, and detective controls.
  3. Compute, storage, database, and network.
  4. Foundations, change management, and failure management.
  5. Recover Time Objective – The maximum acceptable delay between interrupting the service and restoring the service.
  6. Recovery Point Objective – The maximum acceptable amount of time since the last data recovery point (backup).
About the Author
  • Adam Book

    Adam Book has been programming since the age of six and has been constantly tapped by founders and CEOs as one of the pillars to start their online or cloud businesses. Adam has developed applications, and websites. He’s been involved in cloud computing and datacenter transformation professionally since 1996 focusing on bringing the benefits of cloud computing to his clients. He’s led technology teams in transformative changes such as the shift to programming in sprints, with Agile formats. Adam is a cloud evangelist with a track record of migrating thousands of applications to the cloud and guiding businesses in understanding cloud economics to create use cases and identify operating model gaps. He has been certified on AWS since 2014.

    Browse publications by this author
AWS Certified DevOps Engineer - Professional Certification and Beyond
Unlock this book and the full library FREE for 7 days
Start now