Infrastructure Monitoring with Amazon CloudWatch

By Ewere Diagboya
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Introduction to Monitoring

About this book

CloudWatch is Amazon’s monitoring and observability service, designed to help those in the IT industry who are interested in optimizing resource utilization, visualizing operational health, and eventually increasing infrastructure performance. This book helps IT administrators, DevOps engineers, network engineers, and solutions architects to make optimum use of this cloud service for effective infrastructure productivity.

You’ll start with a brief introduction to monitoring and Amazon CloudWatch and its core functionalities. Next, you’ll get to grips with CloudWatch features and their usability. Once the book has helped you develop your foundational knowledge of CloudWatch, you’ll be able to build your practical skills in monitoring and alerting various Amazon Web Services, such as EC2, EBS, RDS, ECS, EKS, DynamoDB, AWS Lambda, and ELB, with the help of real-world use cases. As you progress, you'll also learn how to use CloudWatch to detect anomalous behavior, set alarms, visualize logs and metrics, define automated actions, and rapidly troubleshoot issues. Finally, the book will take you through monitoring AWS billing and costs.

By the end of this book, you'll be capable of making decisions that enhance your infrastructure performance and maintain it at its peak.

Publication date:
April 2021
Publisher
Packt
Pages
314
ISBN
9781800566057

 

Chapter 1: Introduction to Monitoring

Monitoring is a broad topic that covers different human endeavors. Ignorance of monitoring ideals and concepts can adversely affect how to handle and manage engineering and computer systems effectively. Systems are usually not 100% efficient, and there are times they break down or do not work optimally as intended. The only way to understand and predict a breakdown is by monitoring the system. When a system is monitored, its pattern of behavior can be better understood, and this can help to predict a failure before it eventually happens. A proper maintenance process based on what has been monitored can be used to minimize failure of the system.

To help start the journey into monitoring, we will begin with understanding what monitoring is and the building blocks of every monitoring setup and infrastructure. We will explore the techniques used to monitor any infrastructure and for which scenario both of them are designed, and the relationship that exists between different monitoring components. Then, I will explain the importance of monitoring using real-life scenarios to help to emphasize and better your understanding of each importance mentioned. To crown it all, I will explain how the AWS Well-Architected framework portrays monitoring as a very important aspect of your AWS workload, using the principles of the pillar to galvanize what we have already talked about in terms of importance and how it makes the architecture of any Cloud workload complete. The purpose of this chapter is to help you understand what monitoring is, provide a little historical background of monitoring, explain the different ways software applications can be monitored, and shed light on the importance of monitoring and software applications.

In this chapter, we are going to cover the following topics:

  • Introducing monitoring
  • Discovering the types of monitoring
  • Understanding the components of monitoring
  • Getting to know Amazon CloudWatch
  • Introducing the relationship between Amazon CloudWatch and Well-Architected
 

Technical requirements

To be able to engage in the technical section of this chapter, it is required that you already have an AWS account. If you do not have one, you can quickly sign up for the free tier.

Check out the following link to see how to sign up for an AWS account:

https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/

 

Introducing monitoring

Man has always found a way to take note of everything. In ancient times, man invented a way to create letters and characters. A combination of letters and characters made a word and then a sentence and then paragraphs. This information was stored in scrolls. Man also observed and monitored his environment and continued to document findings and draw insights based on this collected information. In some cases, this information might be in a raw form with too many details that might not be relevant or might have been processed into another form that removed irrelevant information, to allow for better understanding and insight.

This means the data was collected as historic data after an activity occurred. This could be a memorable coronation ceremony, a grand wedding occasion, or even a festival or a period of war or hunger and starvation. Whatever that activity is in time, it is documented for various purposes. One of the purposes is to look at the way things were done in the past and look for ways it can either be stopped or made better. There is a saying that goes as follows:

"If you cannot measure it, you cannot improve it."

– Lord Kelvin (1824-1907)

So, being able to make records of events is not only valuable in helping to draw insights but can also spur the next line of action based on the insight that has been drawn from the data.

Borrowing from this understanding of how man has been recording, documenting, and making records, we can list two major reasons for monitoring data —to draw insights from the data collected and to act based on the insights received. This can be taken into consideration with a system that we build too. For every system man has developed, from the time of the pyramids of Egypt, where massive engineering was needed to draw, architect, and build the pyramids and other iconic structures, documentation of historic works has been very essential. It helped the engineers in those days to understand the flaws in earlier designs and structures, to figure out ways the new structures could be designed, and to eventually fix the flaws that were identified. It is usually a continuous process to keep evaluating what was done before to get better and better with time using these past experiences and results. Documented information is also very helpful when the new project to be embarked on is bigger than the earlier one. This gives foundational knowledge and understanding of what can be done for a new and bigger project due to the historical metrics that have been acquired.

Applying new methods go beyond just the data that has been collected—there is also the culture and mindset of understanding that change is constant and always being positioned to learn from earlier implementations. Building new systems should be about applying what has been learned and building something better and, in some cases, improving the existing system based on close observation:

Figure 1.1 – A basic monitoring flow

Figure 1.1 – A basic monitoring flow

What we have been explaining so far is monitoring. Monitoring is the act or process of collecting, analyzing, and drawing insights from data that has been collected from the system. In software systems and infrastructure, this includes analyzing and drawing insights from the data that has been collected from systems performing specific tasks or multiple tasks. Every system or application is made up of a series of activities, which we also call events. Systems in this context can mean mechanical systems (cars, industrial machines, or trucks), electrical systems (home appliances, transformers, industrial electronics machines, or mobile phones), or computer systems (laptops, desktops, or web or mobile applications).

Algorithms are the bedrock of how complex systems are built, a step-by-step approach to solving a problem. When a complex system is built, each of these step-by-step processes that have been built in to solve a specific problem or set of problems can be called an event.

Consider the following example of the process of making a bottle of beer:

  1. Malted barley or sorghum is put in huge tanks and blended.
  2. Yeast is added to the mixture to allow the fermentation process to occur to generate alcohol.
  3. After fermentation, sugar is added to the mixture to sweeten it.
  4. The beer is stored in huge drums.
  5. An old bottle undergoes a mechanical process that washes and disinfects the bottle.
  6. The washed bottle is taken through a conveyor belt to be filled up with beer.
  7. After being filled up, the bottle is corked under high pressure with CO2.
  8. The bottle is then inserted into a crate with other bottles.

In this algorithm of preparing beer, there are various stages; each stage has various touchpoints, and each of these touchpoints is a potential for failure. The failure can be within a process itself or during interaction with other processes. For example, the process of fermentation might not be properly done if the right amount of yeast is not added to the sorghum or if the case of yeast and millet is not air-tight enough, because air is not needed for the fermentation process. Challenges could also arise from the machine that sends the bottle to be crated after corking—there's the possibility of the conveyor belt failing or, during corking, the bottle might explode. These are possibilities and touchpoints that needs close monitoring.

In a nutshell, when a system is designed to perform a specific task or a group of systems are integrated to achieve a common goal, there are always touchpoints both internally and externally that need to be understood. Understanding these touchpoints includes metrics that can be derived from each step of the operation, what normal or good working conditions looks like for both internal and an integration of two systems, and globally acceptable standards. All of this information helps in detecting and finding anomalies when they occur. The only way to detect an activity or metric is an anomaly is by monitoring the system, then collecting and analyzing the data and comparing it with perfect working conditions.

Since we have defined monitoring, the next step thing is to do a sneak peek into the history of monitoring and how it has evolved over time, down to present-day monitoring tools and techniques.

The history of monitoring

We can say for certain that monitoring is as old as man. As mentioned earlier, it is as old as when man started building systems and reviewing what had been done to find faults and fix them and find ways to improve when building a new system. But this book is focused on software monitoring, so we will stick to that.

A computer is made of up of different components, such as the memory, CPU, hard disk, and operating system software. The ability to know what is going on with any of the components goes back to your operating system events logs. The Windows operating system developed the Event Viewer in 1993 as part of the Windows NT system. This internal application takes note of every event in the system, which together forms a list of logs. These logs help to track both core operating system activities that keep the system running and the events of other applications that are installed in the operating system. The Event Viewer can log both normal activities and system failures. The following screenshot shows the Event Viewer:

Figure 1.2 – Windows Event Viewer

Figure 1.2 – Windows Event Viewer

Windows Event Viewer categorizes events into three groups, as shown in Figure 1.2: Custom Views, Windows Logs, and Application and Services Logs. The events captured are also divided into the following categories:

  • Error: This means that the event did not occur and this gives details about the event that failed with other relevant information.
  • Warning: This is a signal about an anomaly that could lead to an eventual error and requires attention.
  • Information: This explains the notification of a successful event.
  • Audit Success: This means that audit of an event was successful.
  • Audit Failure: This means that audit of an event was unsuccessful.

The logs in the Windows Event Viewer look like this:

Figure 1.3 – Event Viewer logs

Figure 1.3 – Event Viewer logs

Figure 1.3 shows a list of events, which eventually forms a log.

Over time, monitoring systems have grown and evolved. Due to the importance of monitoring in applications, different organizations have designed purpose-built monitoring systems. There is a whole industry around application and system monitoring, and it has gone from just events and logging to alerting and graph visualization of log data. The list of monitoring tools and services goes on and on. Here is a summarized list:

  • Datadog
  • Nagios Core
  • ManageEngine OpManager
  • Zabbix
  • Netdata
  • Uptime Robot
  • Pingdom
  • Amazon CloudWatch

Now that we have been able to show the meaning and a brief history of how monitoring started, we understand that monitoring is about making records of events, and the categorization of events can have labels as warnings of something to come or something that has happened and needs resolution. Bearing that in mind, let's go deeper into the types of monitoring that are available based on the way we respond to the type of metrics and the information from logs of events.

 

Discovering the types of monitoring

We now have an understanding of what monitoring is and a brief history of its evolution in terms of techniques and tooling over time. In terms of the techniques of monitoring, there are some concepts that we should keep in mind when designing and architecting monitoring solutions. These concepts encompass any monitoring tool or service that we want to implement, even the one we will be deploying in this book. Let's now take a look at the types of monitoring and the techniques peculiar to each of them, including the pros and cons associated with both.

Proactive monitoring

Before anything goes bad, there are usually warning signs and signals given. In the earlier section, where we defined monitoring, and the Windows Event Viewer, we talked about a category of event called Warning. It is a warning signal that helps you to prepare for a failure, and in most cases, when the warning is too intermittent, it can eventually lead to a failure of that part of the system or it might affect another part of the system. Proactive monitoring helps you to prepare for the possibility of failure with warning signs, such as notifications and alarms, which can be in form of mobile push notifications, emails, or chat messages that hold details of the warning.

Acting based on these warning signs can help to avert the failure that warning sign is giving. An example is an application that used to be fast, and after a while, it starts to slow down and users start complaining about the speed. A monitoring tool can pick up that metric and show that the response time (the time it takes for a website, API, or web application) is high. A quick investigation into what makes it slow can be done, and when found, the issue can be resolved, restoring the application or service back to being faster and more responsive.

Another example of a good reactive monitoring scenario is knowing the percentage of disk space left in a server. The monitoring tool is configured to send warning alerts when the free disk space is utilized 70% and above. This will ensure that the Site Reliability Engineer or the System Administrator who is in charge to take action and empty out the disk for more space because, if that is not done, and the disk is filled up, the server where the application is deployed will no longer be available because the disk is full.

There are many scenarios where proactive monitoring can be used to predict failure, and it is immensely helpful to avoid a system from a total failure or shutdown. It requires that an action is taken as soon as signal is received. In certain scenarios, the notification can be tied to another event that is triggered to help to salvage the system from an interruption.

Proactive monitoring works with metrics and logs or historical data to be able to understand the nature of the system it is managing. When a series of events have occurred, those events are captured in the form of logs, which are then used to estimate the behavior of the system and give feedback based on that. An example is collecting logs from an nginx application server. Each request made to an nginx application server all combine to form logs on the nginx server. The logs can be aggregated, and an alert can be configured to check the number of 404 errors received within a five-minute interval. If it satisfies the threshold to be met, say, less than 20 and greater than 10 404 error messages are allowed within a 5-minute interval, an alert is triggered. This is a warning sign that the website is not fully available for users to gain access to, which is a symptom of a bigger problem that requires some investigation to find out the reason for that high number of 404 errors, within that short period of time.

Important Note

404 is a HTTP keyword for a page that does not exist.

Reactive monitoring

This type of monitoring is more of an aftermath monitoring. This is the type of monitoring that alerts you when a major incident has occurred. Reactive monitoring happens usually when the warnings of the proactive monitors are not heeded, and actions are not taken to resolve all symptoms presented. This will lead to an eventual failure of the full system or some part of it, depending on the architecture of the system, whether it is monolith or a microservice architecture. A basic example of reactive monitoring is to create a ping that continues to ping your application URL continuously and check for the HTTP status code for success, which is code 200. If it continues to get this response, it means the service is running fine and it is up. In any situation where it does not get a 200 or 2xx response, or 3xx response, it means the service is no longer available or the service is down.

Important Note

The 2xx response code means anything from 200 to 205, which means a service is OK. A 3xx response code is for redirecting; it could be permanent or temporary redirect. Reponses that indicate failure include 4xx, which are application errors, and 5xx, which are server-side errors.

This is what the monitoring tool checks and it sends a notification or alert immediately if it does not get a 200-status code. This is usually used for APIs, web applications, websites, and any application that has a URL that makes requests over the HTTP/TCP protocol.

Since this monitoring throws an alert after the failure, it is termed reactive monitoring. It is after the alert you find out something has gone wrong and then go in to restore the service and investigate what caused the failure and how to fix the issue. In most cases, you have to do root cause analysis, which will involve using techniques from proactive monitoring and look at logs and events that have occurred in the system to understand what led to the failure of the system.

Important Note

Root cause analysis is a method of problem solving that involves deep investigation into the main cause or the trigger to the cause of a system mal-function or failure. It involves analyzing different touch points of the system and corroborating all findings to come to a final conclusion to the cause of the failure. It is also called RCA for short.

Endpoint monitoring services are used for reactive monitoring such as Amazon CloudWatch Synthetics Canary, which we will be talking about later in this book. We will not only use simple endpoint pinging to get status codes but much more than that because Synthetics Canary can be configured to do more than just ping endpoints for monitoring.

 

Understanding the components of monitoring

Now that we understand the types of monitoring and how they can be applied in different scenarios, the next thing is to look at the components that monitoring needs to be possible. Every monitoring architecture or setup works with some base components, when implementing any of the types of monitoring in the previous section. They are as follows:

  • Alerts/notifications
  • Events
  • Logging
  • Metrics
  • System availability
  • Incidence

Alerts/notifications

An alert/notification is an event that is triggered to inform the system administrator or site reliability engineer about a potential issue or an issue that has already happened. Alerts are configured with a certain metric in mind. When configuring an alert or notification, there is a metric that is being evaluated and validated against. When that condition is met, the alert is triggered to send a notification.

Notifications can be sent using different media. Alerts can be sent using SMS (Simple Messaging Service), email, mobile app push notifications, HTTP Push Events, and many more. The message that is sent via this different media contains information about the incident that has occurred or the metric constraint that has been met. Alerts can be used for both proactive monitoring, to warn sysadmins about a high network I/O and for reactive monitoring, to notify an Site Reliability Engineer (SRE) that an API endpoint is down. An AWS service that is specialized in this area is Amazon SNS. The notifications we shall configure in this book will use Amazon SNS to send notifications.

Important Note

Amazon SNS is a fully-managed messaging service that is used for sending notification using SMS, push notification, HTTP call, and email as the medium. SNS does not require you to set up any servers or manage any SMTP or SMPP servers for sending emails or SMS. AWS manages all of that for you and gives you an interactive UI, CLI, or API to manage the SNS service and use all of these media to send notifications.

Most users only use the SNS email service to just get notifications when something in a system goes down or goes wrong or they want to get a warning about something. The SNS HTTP call topic can be used to trigger another service such as an EventBus to start another process such as a background process to clean temporary files in the server, or create a backup based on the warning signal that has been received. The SRE can tie automated runbooks to HTTP endpoints, which SNS topic can trigger as a notification.

Important Note

An SRE is someone who is in charge of making sure that applications and services maintain the highest uptime possible. Uptime is usually measured in percentage. Uptime states that the system hardly goes down or is unavailable for customers to use. A good uptime for a website is 99.9%. A good tool to measure uptime is https://uptime.is.

Events

Any action or activity or series of activities that occur in a system is called an event. In computer systems, there are various events and activities that go on in the background to keep the computer running. A very simple example of an event is the clock. It is a background process that ensures the clock continues to tick to ensure time is kept. Each tick of the clock can be called an event. In the hardware that makes up a PC, there are components such as the CPU, memory, and hard disk. They all have series of events that they perform from time to time. The disk is the memory storage of data in the computer, and it usually performs two basic operations or events—either reading data that has been written to it or trying to write in new data. We can also call these operations events of the disk.

In software programs, every function or method can be called an event. Software programs are made up of hundreds to thousands of methods or functions. Each of these functions has a unique operation they perform to be able to solve a specific problem. The ability to track each of these events is very important in monitoring software systems and applications.

Logging

A log is a historical record of an event. A log does not only have the events and details of the event; it also contains the time that event occurred. Mostly, they are called logs. This means that a series of events form logs. Every programming language, when used to develop, generates logs. It is through logs that developers are able to spot bugs in code. When a log is generated by the interpreter, it is read and articulated, which informs the developer about what the bug could be and allows the developer to know what needs to be tweaked to be able to fix the particular bug that has been identified.

We also showed the Microsoft Event Viewer in the previous section, which contains a list of events. This list of events eventually forms what is called logs. They are events that have taken place with the description of the events, the status of the event, and the date and time the event occurred.

The following screenshot shows an example of a list of events that forms logs:

Figure 1.4 – List of events forms logs

Figure 1.4 – List of events forms logs

Logs are the heart of monitoring because they give raw data that can be analyzed to draw insights from the behavior of the system. In many organizations, logs are kept for a specific period of time for the purpose of system audit, security analysis/audits and compliance inspection. In some cases, logs can contain sensitive information about an organization, which can be a potential vulnerability that hackers and crackers can use to attack and exploit the system.

Mostly, logs are stored in filesystems where the application is running, but logs can grow so large that storing them in a filesystem might not be very efficient. There are other places logs can be stored that scale infinitely in size of storage for log files. These will be revealed more as we go deeper in this book:

Figure 1.5 – A sample of an nginx log

Figure 1.5 – A sample of an nginx log

Figure 1.5 is another example of events that form an nginx log. This is taken from the access log file of a nginx server.

Metrics

A metric is the smallest of unit of insight that is obtained from a log. Metrics give meaning to logs that have been collected from a system. They indicate a standard of measurement for different system components. In a huge collection of logs, what is usually needed is a single explanation for all of the information that has been captured. It could be the estimated disk space that is left or the percentage of memory that is being consumed. This single piece of information helps the SRE or sysadmin to know how to react. In some cases, the metric is fed into a more automated system that responds according to the data received

A simple example is the auto-scaling feature in AWS. This feature helps to create or spin up a new server when something goes wrong with the existing server. One metric that can be used to trigger this is the CPU consumption of the current running server. If the CPU consumption is above 90%, this could mean that, within a few minutes or hours, that server will no longer be reachable or available. Therefore, a remedy needs to be provided for that before the CPU consumption exceeds 90%. That information can be used to create a new server to either replace the existing server or added as part of the load balancer to ensure that the application or service does not have downtime.

The following diagram illustrates how auto-scaling works:

Figure 1.6 – Autoscaling based on instance metrics

Figure 1.6 – Autoscaling based on instance metrics

Another use of a metric is in detecting malicious network activities. When the network activity of your cloud resources is closely monitored, there might be an anomaly in a metric such as the NetworkIn (which is a metric that measures the number of bytes of data that is transferred inside the network infrastructure). Anomalies could mean very high traffic at a particular time; this could mean that the resources on that network are being hit by unnecessary DDoS requests that could lead to a lot of different scenarios that are negative to the application.

The metric is key to understanding summarized information of what is going on and attach a label to huge events and logs of information that is received from various systems and take action based on this intelligence.

System availability

Availability in simple context means that something or someone is accessible. System availability in that same context means that a system is available for use, by users or customers who require it. In software systems, the availability of your website, application, or API means that it is accessible to whoever needs it and whenever they need to use it. It could be that shopping website that customer needs to access to purchase those Nike sneakers or a developer who needs to integrate a payment service API to their system to enable users to make payments. If the customer or developer is not able to access it anytime they need it, then that service is termed not highly available.

To understand the availability of any system, monitoring plays a very key role. The ability to know when the system is up or the system is down can be aggregated to get the system availability within a period of time. This is generally called system uptime or just uptime. The system uptime of any system can be calculated as follows:

Figure 1.7 – Formula for calculating availability

Figure 1.7 – Formula for calculating availability

In the preceding formula, we have the following:

  • Total Uptime: How long the system has been available to the user or customer in hours.
  • Total Downtime: How long the system has been unavailable to the user customer in hours.
  • Availability: The final system availability as a decimal fraction. Which is then multiplied with 100 to get the percentage availability.

Another scenario in applying this is, say, we want to calculate the availability of an API application serving third-party customers who integrate with it for Forex Indices. Let's say within a month, the API was available for a total of 300 hours. Within that same month, there was a huge surge of traffic on the API due to some announcement in the news, which led to the API being unavailable for about 3 hours. Then, the development team also had to do a new update, which involved changes in the API functionality due to the surge of users on the API. The release of this update cost another 4 hours of downtime in the system. This brings the total downtime of the system within the month to 7 hours. Within that same month, the security team needed to look at the logs for roughly 1 hour during monthly system maintenance. This led to another 1 hour 30 mins of downtime.

We can calculate the availability of this system as follows:

  • Total Uptime = 300 hours
  • Downtime1 = 3 hours
  • Downtime2 = 4 hours
  • Downtime3 = 1 hour 30 mins = 1.5 hours
  • Total Downtime = 3+4+1.5=8.5 hours
  • Total Uptime + Total Downtime = 300 + 8.5 = 308.5 hours
  • Availability = 300 / 308.5 = 0.9724
  • Availability as a percentage = 97.24%

But it is not about this number—how do we actually interpret the meaning of 97.24% availability? There is a chart that will help us to understand the meaning of this number. We might say it is a good number because it is quite close to 100%, right? But it is actually more than that:

Table 1.1 – Uptime chart

Table 1.1 – Uptime chart

If we are to approximate the uptime of our system based on the calculation, it will round down to 97%. Taking this value and checking it on the preceding chart, we can see that this value means the following:

  • 10.96 days of downtime in a year
  • 21.92 hours of downtime in a month
  • 5.04 hours of downtime in a week
  • 43.02 minutes of downtime in a day

The fact that monitoring can help us to understand the availability of our system is one step. But this system being monitored is used by our customers. Customers expect our system to be up and running 24/7. They are hardly concerned about any excuses you might have to give for any downtime. In some cases, it could mean losing them to your competition. Organizations do well to communicate and promise their customers system availability. This gives the customer a level of expectation of the Quality of Service (QoS) to be received by the organization. It also helps to boost customer confidence and gives the business a benchmark to meet.

This indicator or metric is called an SLA. SLA is an acronym for Service Level Agreement. According to Wikipedia, SLA is a commitment between a service provider and a client. In simple terms, an SLA is the percentage of uptime a service provider gives to the customer—anything below that number, and the customer is allowed to lay claims and receive compensation. The onus is on the service provider to ensure they do not go below that SLA that has been communicated to the customer.

Dashboard

For every event, log, or metric that is measured or collected, there is always a better way to represent the data being collected. Dashboards are a way to present logs and metric data in a graphically appealing manner. Dashboards are a combination of different graphical representations of data, which could be in the form of bar charts, line graphs, bar graphs, histograms, scattered plots, or pie charts. These representations give the user a summarized version of the logs, which makes it easy to spot things such as trends in a graph.

When there is a rise in Disk I/O, in the number of bytes of data written per second, one of the fastest ways to represent this is through a line graph, which will have a directly proportional slope showing the gradual rise traffic from one point to another. If, during the night, there was some sudden spike in the memory consumption of one of the servers, due to high customer usage of the service, a line graph can be easily used to spot the time of the day the spike happened and see when it came back down.

These, and many more are the values of having a dashboard, which gives a graphical representation of the data collection from the logs of the system. Metrics are also represented in graphs for much easier interpretation. Amazon CloudWatch has a built-in dashboard where different types of graphs can be created and hence added to the dashboard based on certain specifications, or to group related data for easy understanding of the logs and making better meaning of the log data collected:

Figure 1.8 – Sample CloudWatch dashboard

Figure 1.8 – Sample CloudWatch dashboard

Next, we will understand what an incident is.

Incidents

An incident is an event, condition, or situation that causes a disruption to the normal operation of a system or an organization. Incidents are negative to a system and are related to reactive monitoring. Incidents make a website, API, or application slow or unavailable to a user. Different things can trigger or cause an incident. It could range from a bug in the application that led to the application being totally unavailable, to a security incident where an attacker collected sensitive user or customer information. They are all termed incidents. Some of these incidents can actually be captured by the monitoring tool to show when the incident occurred and can form part of the incident report that is documented in your organization.

It is advisable for every organization to have an Incident Management Framework. This framework defines how failures, or any form of incident reported, is managed by the SRE/sysadmin team. Incidents are usually captured by the monitoring tool. When an attacker performs a brute force attack on a Linux server and gains access, this activity can be picked up by monitoring tools and an alert is sent over to the team. This will help the security team to investigate the issue and ensure it never occurs again. The incident framework guides every team within the organization on how to react in the event of an incident. There are usually levels of incidents labeled according to their levels of severity. In most cases, these are SEV1, SEV2, or SEV3, which means severity 1, severity 2, and severity 3, respectively. The numbers indicate the priority or intensity of the severity.

Phewww! While that was quite a lot of information, these components are at the heart of monitoring architecture and infrastructure. We have seen how dashboards help with proactive monitoring and understanding the infrastructure even before disaster strikes. The next thing is to look at the importance of monitoring, and how these components form up different aspects of the importance of monitoring.

When it comes to the value monitoring gives, there are major items that make it at the very core of every system that is designed and implemented. As far as it is a system that has been designed and is bound to have unforeseen circumstances, then the importance of monitoring the system can never be undermined. Disregarding the importance of monitoring means that the lifetime of a system is not put into consideration. We can also say, fine, we want to monitor this system or that system, but what are the core values that can be derived when engaging in monitoring activities, be it a pro-active monitoring approach or the reactive monitoring? There are key reasons to monitor a system:

  • Realizing when things go south
  • Ability to debug
  • Gaining insights
  • Sending data/notifications to other systems
  • Controlling Capital Expenditure (CapEx) to run cloud infrastructure

We have been able to list some reasons to ensure monitoring is part of your infrastructure and application deployments. Let's expatiate on these with examples to help to drive home the essence on each point mentioned in the preceding.

Realizing when things go south

Organizations that do not have any kind of monitoring service suffer from customers being the ones to inform them of a downtime. The ability to know this before customers raise it is very important. It paints a bad picture when users go to social media to share negative opinions of a system downtime before the company finds out there is/was a downtime. Reactive monitoring is the technique that helps with this. Using simple endpoint monitoring that pings your endpoints and services from time to time to give feedback is very important. There are times an application might be running but customers are not able to reach it for various reasons. Endpoint monitoring can help to send email alerts or SMS notifications to the SRE team to notify them of a downtime before the customer makes any kind of complaint. The issue can quickly be resolved to improve overall service availability and MTTR.

Important Note

MTTR is an acronym for Mean Time to Recover. It is the measure of how quickly a system recovers from failure.

Ability to debug

When an application has a bug, be it a functional or non-functional bug, the developer needs logs or a way to trace the round-trip of the application to be able to understand where in the flow the bottleneck is present. Without logs or a way to have a bird's-eye view of the behavior, it is almost impossible to debug the application and come up with a solution after the problem is understood. In this scenario, reactive monitoring is the technique to be applied here. Logs from the application server or web server will lead to the bug in the system.

Gaining insight

Insights to the behavior of your system is critical to the progress or retrogression of your application. Insights that can be gained from any application is quite broad. These insights can range from the internal server components to the behavior of the network over time. These insights might not be to actually fix a problem but rather to be able to understand the state of the system from time to time. The ability to spot trends in the system, viewing intermittent behavior, planning capacity for infrastructure, and improving cost optimization are some of the activities that can be carried out to make the system better based on the insights that have been obtained. When there is monitoring of the environment, a rogue NAT gateway that is not needed for an architecture can be deleted, which could save huge costs, considering what Amazon VPC NAT gateways cost, especially when they are not actively in use.

Sending data/notifications to other systems

Monitoring is not just about watching events and generating logs and traces of these events. It also involves taking action based on the logs and metrics that have been generated. With the data obtained from monitoring systems or a notification, an automated recovery operation can be tied to that metric, which can recover the system without any manual intervention from the SREs or sysadmins. Amazon EventBridge is a service that can also send events to third-party SaaS solutions, CloudWatch can be configured to send triggers to Amazon EventBridge to carry operations on other systems that are not within the AWS infrastructure.

Controlling CapEX to run cloud infrastructure

CapEx can be managed when things are monitored. In using different AWS services, there is always the possibility of not keeping track of the resources being provisioned and overspend. The capital expense is what it costs to run a particular cloud service. Monitoring with a budget and billing alarm can be a life saver to alert you when you are spending above the budget that has been set for that particular month. This means the bill is being monitored, and when the services running go over the budget, an email alert is sent to notify you of it. There are also alarms that notify at the beginning of every month, to notify you of the possible forecast for that month.

We have understood the meaning of monitoring and its historical background from the days of Microsoft Windows Event Viewer to new tools that have evolved from that singular basic understanding. Then, we discussed the types of monitoring that can be employed and the strategies of those types of monitoring. We also identified the major components that must be considered when setting up or configuring any monitoring infrastructure. Finally, we have understood the importance of monitoring, drawing from the types and components and strategies we learned and the value each of these bring to the importance of monitoring as a whole. The next stage is to introduce the monitoring service that this book is based on, which is Amazon CloudWatch.

 

Getting to know Amazon CloudWatch

Amazon CloudWatch is a service designed by AWS. It is an all-encompassing and complete end-to-end monitoring solution, used for monitoring applications, servers, serverless applications, on-premises systems, cloud-native applications, storage services, networking services database services, and many more. CloudWatch has the ability to collect and store logs of applications deployed in any environment. You do not have to worry about any servers to set up, configure, or manage for your logs' storage and management. CloudWatch stores petabytes of logs that for AWS users. It is embedded with tools that make it easy to loop through and interpret log data that has been collected.

CloudWatch does not only store logs, it also has its own CloudWatch dashboard, which is used to draw different types of mathematical graphs used for data interpretation. With its built-in CloudWatch Events, which is gradually being migrated into Amazon EventBridge, it is able to respond to events based on specific conditions, and those events can further trigger other specific operations that have been configured with it. CloudWatch Event rules can be configured to trigger a specific CloudWatch Event to occur. A simple example could be configuring the CloudWatch Event to shut down an EC2 instance by 7 p.m. when the business closes down, and start it up by 7 a.m. when everyone is back at work.

Important Note

A CloudWatch Event rule is a functionality that is part of Amazon CloudWatch, which is used to schedule the time an operation should be performed. It is usually associated with Amazon CloudWatch Events.

One very important feature of every monitoring tool is also in CloudWatch, which is alerts. CloudWatch has a rich alerting system that works in connection with Amazon SNS. Alerts can be configured to trigger based on specific metrics identified in logs received into CloudWatch. Mathematical symbols can be used to configure granular metrics and specific time intervals to determine when an alert will be sent and for what particular reason that particular alert was sent. In some cases, the alerts can be used as the first point of call to solve the issue that has been identified, one of which is rebooting an EC2 instance that refused to start up for any reason.

CloudWatch also has a feature called Synthetics, which makes it possible for CloudWatch to send intermittent pings to an endpoint, a website, an API, or a web application to check for the status and when it is down, and it can send a notification. This means that CloudWatch can be used for both proactive and reactive monitoring.

This book will show how to set up, configure, and manage different types of infrastructure from the monitoring perspective. From Chapter 3, CloudWatch Logs, Metrics and Dashboard, through to Chapter 9, Monitoring Storage Services with Amazon CloudWatch, we will be configuring the monitoring of different AWS services and resources. This will be after a brief introduction to the service and its components.

 

Introducing the relationship between Amazon CloudWatch and Well-Architected

The AWS Well-Architected framework is a set of principles that can be used to govern how an application is architected, developed, deployed, and scaled in the cloud. It is a compilation of the experience of hundreds of AWS solution architects with decades of experience across various industries who have designed, managed, and scaled various types of systems. All of this knowledge and experience has all been put together to form the summarized principles that go into the AWS Well-Architected framework. This Well-Architected framework is made up of five pillars:

  • Security
  • Cost Optimization
  • Performance Efficiency
  • Operational Excellence
  • Reliability

Each of these pillars covers a wide range of tenets for different aspects of infrastructure setup scaling, security, and deployment. But we will be looking at the monitoring aspect of the Well-Architected framework.

The Reliability pillar focuses on building systems that are reliable, stand the test of time, and are always available to do what they have been designed to do. This will require close monitoring of the system from time to time. It also refers to managing service quotas for different AWS resources as you continue to use the different AWS services.

Important Note

AWS provides an OnDemand scale for server resources and application services. This is managed using service quotas, which is a regulatory technique used by AWS to manage the maximum value of resources, actions and items in your AWS account (https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html).

CloudWatch alarms can be configured for these quotas so that alerts can be received when you are almost hitting the limit of the service allocation. This can be used to protect against a possible workload failure in situations when a service is needed and the limit has been exceeded.

With Performance Efficiency, the focus is more on ensuring that the application is performing well at all times. Two of the ways to ensure the system performs well are always having insights and understanding the behavior of the workloads and application. Rigorous testing of the application using various methods such as load testing can be very helpful in seeing the behavior of the system under load. When the load test is carried out, metrics and logs are collected. These logs are further studied to understand and gain insights into the behavior of the system. This can be done for the staging or test setup of the application, and it can help SREs understand what to get when the application is eventually released to customers.

The fear of bills is what chases a lot of first-time cloud users. The Cost Optimization pillar in the Well-Architected framework is focused on optimizing your AWS bill by using cost-effective services and designs when deploying workloads in your AWS infrastructure. But the part of CloudWatch that is connected to your cost is the auto-scaling feature, which can be very helpful in reducing the cost of your overall workload. CloudWatch metrics can be used to trigger the scaling up or scaling down of your infrastructure based on thresholds that have been configured. This can go a long way to save costs so that when the server resources being consumed are low, CloudWatch reduces the number of servers being used, but when the number goes higher, CloudWatch can still identify that and trigger a scale-up to add more servers and balance the load hitting the application.

 

Summary

Monitoring is quite a large topic; a lot can be said about it. But what we have done in this chapter is to consider monitoring a natural human attitude and characteristic. The fact that anything that is built can fail means that there is a natural instinct to ensure that things are monitored, understood, and augmented to work better with time. We applied this concept to computing and explained that computing brings automation into this natural human process of monitoring, and we talked about the different components of monitoring computer systems. We covered logs, metrics, dashboards, and incidents and explained the meaning of each of these components. Next, we explained the importance of monitoring, pinpointing specific key reasons to monitor your application workload and infrastructure. Then, we moved on to explain Amazon CloudWatch, the AWS managed end-to-end monitoring service that is built with all of the features that any monitoring infrastructure or service will require. Lastly, the icing on the cake was the AWS Well-Architected framework, which serves as a boilerplate for everything cloud-native and monitoring is not left out.

This has given us a solid foundation to understand the fundamentals and components of monitoring and the importance of monitoring in the day-to-day activity of an SRE. We have also seen that CloudWatch is a managed service that takes away the operational expense of running our own cloud infrastructure. This foundational knowledge will be beneficial as we go deeper into this book.

In the next chapter, we will take our first step into Amazon CloudWatch to understand the components, events, and alarms.

 

Questions

  1. Which pillars of the AWS Well-Architected framework are focused on monitoring?
  2. Which protocol is mostly used to send alert notifications?
  3. Being able to know and forecast when things are going to have a problem is an important part of monitoring. What is this type of monitoring called?
  4. What is the smallest unit of insight that helps to make all of the logs and data collected from applications and infrastructure meaningful called?
 

Further reading

Refer to the following links for more information on topics covered in this chapter:

About the Author

  • Ewere Diagboya

    Ewere Diagboya is a technology lover at heart who started his journey into technology back in 2003. At an early age, he finished mastering Windows 98 applications. He learned coding in Visual Basic 6.0 with the help of his brother.

    Afterward, he mastered other programming languages, building both desktop and web applications. Ewere moved into cloud computing and DevOps and has been practicing, training, and mentoring for close to a decade now.

    He worked closely with other like minds to start the AWS User Group in Nigeria and fostered DevOps Nigeria Group vibrancy. In 2020 he was awarded the first AWS Community Hero award in Africa. Ewere also runs a training institute and has a blog on Medium where he writes about cloud computing and DevOps.

    Browse publications by this author
Book Title
Access this book and the full library for FREE
Access now