Datadog Cloud Monitoring Quick Start Guide

5 (1 reviews total)
By Thomas Kurian Theakanath
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Introduction to Monitoring
About this book

Datadog is an essential cloud monitoring and operational analytics tool which enables the monitoring of servers, virtual machines, containers, databases, third-party tools, and application services. IT and DevOps teams can easily leverage Datadog to monitor infrastructure and cloud services, and this book will show you how.

The book starts by describing basic monitoring concepts and types of monitoring that are rolled out in a large-scale IT production engineering environment. Moving on, the book covers how standard monitoring features are implemented on the Datadog platform and how they can be rolled out in a real-world production environment. As you advance, you'll discover how Datadog is integrated with popular software components that are used to build cloud platforms. The book also provides details on how to use monitoring standards such as Java Management Extensions (JMX) and StatsD to extend the Datadog platform. Finally, you'll get to grips with monitoring fundamentals, learn how monitoring can be rolled out using Datadog proactively, and find out how to extend and customize the Datadog platform.

By the end of this Datadog book, you will have gained the skills needed to monitor your cloud infrastructure and the software applications running on it using Datadog.

Publication date:
June 2021
Publisher
Packt
Pages
318
ISBN
9781800568730

 

Chapter 1: Introduction to Monitoring

Monitoring is a vast area and there is a confusing array of tools and solutions that address various requirements in that space. However, upon a closer look, it would become very clear that there are a lot of common features shared by these products. Therefore, before we start discussing Datadog as a monitoring tool, it is important to understand the core ideas and terminology in monitoring.

In this chapter, we are going to cover the following topics:

  • Why monitoring?
  • Proactive monitoring
  • Monitoring use cases
  • Monitoring terminology and processes
  • Types of monitoring
  • Overview of monitoring tools
 

Technical requirements

There are no technical requirements for this chapter

 

Why monitoring?

Monitoring is a generic term in the context of operating a business. As part of effectively running a business, various elements of the business operations are measured for their health and effectiveness. The outputs of such measurements are compared against the business goals. When such efforts are done periodically and systematically, it could be called monitoring.

Monitoring could be done directly or indirectly, and voluntarily or dictated by some law. Successful businesses are good at tracking growth by using various metrics and taking corrective actions to fine-tune their operations and goals as needed.

The previous statements are common knowledge and applicable to any successful operation, even if it is a one-time event. However, there are a few important keywords we already mentioned – metrics, measurement, and goal. A monitoring activity is all about measuring something and comparing that result against a goal or a target.

While it is important to keep track of every aspect of running a business, this book focuses on monitoring software application systems running in production. Information Technology (IT) is a key element of running businesses and that involves operating software systems. With software available as a service through the internet (commonly referred to as Software as a Service or just SaaS), most software users don't need to run software systems in-house, and thus there's no need for them to monitor it.

However, SaaS providers need to monitor the software systems they run for their customers. Big corporations such as banks and retail chains may still have to run software systems in-house due to the non-availability of the required software service in the cloud or due to security or privacy reasons.

The primary focus of monitoring a software system is to check its health. By keeping track of the health of a software system, it is possible to determine whether the system is available or its health is deteriorating.

If it is possible to catch the deteriorating state of a software system ahead of time, it might be possible to fix the underlying issue before any outage happens, which would ensure business continuity. Such a method of proactive monitoring that provides warnings well in advance is the ideal method of monitoring. But that is not always possible. There must also be processes in place to deal with the outage of software systems.

A software system running in production has three major components:

  • Application software: A software service provider builds applications that will be used by customers or the software is built in-house for internal use.
  • Third-party software: Existing third-party software such as databases and messaging platforms are used to run application software. These are subscribed to as SaaS services or deployed internally.
  • Infrastructure: This usually refers to the network, compute, and storage infrastructure used to run the application and third-party software. This could be bare-metal equipment in data centers or services provisioned on a public cloud platform such as AWS, Azure, or GCP.

Though we discussed monitoring software systems in broad terms earlier, upon a closer look, it is clear that monitoring of the three main components mentioned previously is involved in it. The information, measured using related metrics, is different for each category.

The monitoring of these three components constitutes core monitoring. There are many other aspects – both internal to the system, such as its health, and external, such as security – that would make the monitoring of a software system complete. We will look at all the major aspects of software system monitoring in general in this introductory chapter, and as it is supported by Datadog in later chapters.

 

Proactive monitoring

Technically, monitoring is not part of a software system running in production. The applications in a software system can run without any monitoring tools rolled out. As a best practice, software applications must be decoupled from monitoring tools anyway.

This scenario sometimes results in taking software systems to production with minimal or no monitoring, which would eventually result in issues going unnoticed, or, in the worst-case scenario, users of those systems discovering those issues while using the software service. Such situations are not good for the business due to these reasons:

  • An issue in production impacts the business continuity of the customers and, usually, there would be a financial cost associated with it.
  • Unscheduled downtime of a software service would leave a negative impression on the users about the software service and its provider.
  • Unplanned downtime usually creates chaos at the business level and triaging and resolving such issues can be stressful for everyone involved and expensive to the businesses impacted by it.

One of the mitigating steps taken in response to a production issue is adding some monitoring so the same issue will be caught and reported to the operations team. Usually, such a reactive approach increases the coverage of monitoring organically, but not following a monitoring strategy. While such an approach will help to catch issues sometimes, there is no guarantee that an organically grown monitoring infrastructure would be capable of checking the health of the software system and warning about it, so remediation steps can be taken proactively to minimize outages in the future.

Proactive monitoring refers to rolling out monitoring solutions for a software system to report on issues with the components of the software system, and the infrastructure the system runs on. Such reporting can help with averting an impending issue by taking mitigating steps manually or automatically. The latter method is usually called self-healing, a highly desirable end state of monitoring, but hard to implement.

The key aspects of a proactive monitoring strategy are as follows.

Implementing a comprehensive monitoring solution

Traditionally, the focus of monitoring has been the infrastructure components – compute, storage, and network. As you will see later in this chapter, there are more aspects of monitoring that would make the list complete. All relevant types of monitoring have to be implemented for a software system so issues with any component, software, or infrastructure would be caught and reported.

Setting up alerts to warn of impending issues

The monitoring solution must be designed to warn of impending issues with the software system. This is easy with infrastructure components as it is easy to track metrics such as memory usage, CPU utilization, and disk space, and alert on any usage over the limits.

However, such a requirement would be tricky at the application level. Sometimes applications can fail on perfectly configured infrastructure. To mitigate that, software applications should provide insights into what is going under the hood. In monitoring jargon, it is called observability these days and we will see later in the book how that can be implemented in Datadog.

Having a feedback loop

A mature monitoring system warning of impending issues that would help to take mitigation steps is not good enough. Such warnings must also be used to resolve issues automatically (for example, spinning off a new virtual machine with enough disk space when an existing virtual host runs out of disk space), or be fed into the redesigning of the application or infrastructure to avoid the issue from happening in the future.

 

Monitoring use cases

Monitoring applications are installed and configured based on where the applications to be monitored are run. Here, we will look at a few use cases of how monitoring is rolled out in different scenarios to understand the typical configurations.

All in a data center

This is a classic scenario of monitoring in which both the infrastructure that hosts the applications and the monitoring tools are in one or more data centers. The data centers could be privately owned by the business or a hosting facility that is rented out from a data center provider. The latter option is usually called co-location.

This figure illustrates how both the software application and monitoring tool are running from the same data center.

Figure 1.1 – Single data center

Figure 1.1 – Single data center

The following figure illustrates how the software application and monitoring tool are running from two data centers, which ensures the availability of the software system:

Figure 1.2 – All in a data center with multiple data centers

Figure 1.2 – All in a data center with multiple data centers

If the application is hosted from multiple data centers, and if one of them becomes inaccessible, it would be possible for the monitoring system to alert on that. If only one data center is in use, then that will not be viable as the entire monitoring system would also become inaccessible along with the software system it monitors.

Application in a data center with cloud monitoring

This is an emerging scenario where businesses have their infrastructure hosted in data centers and a cloud monitoring service such as Datadog is used for monitoring. In this scenario, the monitoring backend is in the cloud and its agents run in the data centers alongside the application system.

Figure 1.3 – Application in a data center with cloud monitoring

Figure 1.3 – Application in a data center with cloud monitoring

There is no need to monitor the monitoring system itself as the SaaS providers provide its status.

All in the cloud

There could be two different cases in an all-in-the-cloud scenario of monitoring. In both cases, the infrastructure that runs the software system will be in the cloud, typically on a public cloud platform such as AWS. The entire monitoring system can be deployed on the same infrastructure or a cloud monitoring service such as Datadog can be used, in which case, only its agent will be running alongside the application system.

Figure 1.4 – All in the cloud with in-house monitoring

Figure 1.4 – All in the cloud with in-house monitoring

In the first case of an all-in-the-cloud scenario, you need to set up monitoring or leverage monitoring services provided by a public cloud provider such as CloudWatch on AWS or use a combination of both.

Figure 1.5 – All in the cloud with cloud monitoring

Figure 1.5 – All in the cloud with cloud monitoring

In the second case of an all-in-the-cloud scenario, a third-party SaaS monitoring service such as Datadog is used. The main attractions of using SaaS monitoring services are their rich set of features, the high availability of such services, and the minimal overhead of rolling out and maintaining monitoring.

Using cloud infrastructure has the added advantage of having access to native monitoring tools such as CloudWatch on AWS. These services are highly reliable and can be used to enhance monitoring in multiple ways to do the following:

  • Monitor cloud resources that are hard to monitor using standard monitoring tools
  • Be used as a secondary monitoring system, mainly to cover infrastructure monitoring
  • Monitor the rest of the monitoring infrastructure as a meta-monitoring tool

The scenarios we discussed here were simplified to explain the core concepts. In real life, a monitoring solution is rolled out with multiple tools and some of them might be deployed in-house and others might be cloud-based. When such complex configurations are involved, not losing track of the main objectives of proactive monitoring is the key to having a reliable monitoring system that can help to minimize outages and provide operational insights that will contribute to fine-tuning the application system.

 

Monitoring terminology and processes

Now, let's look at the most commonly used monitoring terminology in both literature and tools. The difference between some of these terms is subtle and you may have to pay close attention to understand them.

Host

A host used to mean a physical server during the data center era. In the monitoring world, it usually refers to a device with an IP address. That covers a wide variety of equipment and resources – bare-metal machines, network gear, IoT devices, virtual machines, and even containers.

Some of the first-generation monitoring tools, such as Nagios and Zenoss are built around the concept of a host, meaning everything that can be done on those platforms must be tied to a host. Such restrictions are relaxed in new-generation monitoring tools such as Datadog.

Agent

An agent is a service that runs alongside the application software system to help with monitoring. It runs various tasks for the monitoring tools and reports information back to the monitoring backend.

The agents are installed on the hosts where the application system runs. It could be a simple process running directly on the operating system or a microservice. Datadog supports both options and when the application software is deployed as microservices, the agent is also deployed as a microservice.

Metrics

Metrics in monitoring refers to a time-bound measurement of some information that would provide insight into the workings of the system being monitored. These are some familiar examples:

  • Disk space available on the root partition of a machine
  • Free memory on a machine
  • Days left until the expiration of an SSL certificate

The important thing to note here is that a metric is time-bound and its value will change. For that reason, the total disk space on the root partition is not considered a metric.

A metric is measured periodically to generate time-series data. We will see that this time-series data can be used in various ways – to plot charts on dashboards, to analyze trends, and to set up monitors.

A wide variety of metrics, especially those related to infrastructure, are generated by monitoring tools. There are options to generate your own custom metrics too:

  • Monitoring tools provide options to run scripts to generate metrics.
  • Applications can publish metrics to the monitoring tool.
  • Either the monitoring tool or others might provide plugins to generate metrics specific to third-party tools used by the software system. For example, Datadog provides such integrations for most of the popular tools, such as NGINX. So, if you are using NGINX in your application stack by enabling the integration, you can get NGINX-specific metrics.

Up/down status

A metric measurement can have a range of values, but up or down is a binary status. These are a few examples:

  • A process is up or down
  • A website is up or down
  • A host is pingable or not

Tracking the up/down status of various components of a software system is core to all monitoring tools and they have built-in features to check on a variety of resources.

Check

A check is used by the monitoring system to collect the value of metrics. When it is done periodically, time-series data for that metric is generated.

While time-series data for standard infrastructure-level metrics is available out of the box in monitoring systems, custom checks could be implemented to generate custom metrics that would involve some scripting.

Figure 1.6 – Active check/pull model

Figure 1.6 – Active check/pull model

A check can be active or passive. An active check is initiated by the monitoring backend to collect metrics values and up/down status info, with or without the help of its agents. This is also called the pull method of data collection.

Figure 1.7 – Passive check/push model

Figure 1.7 – Passive check/push model

A passive check reports such data to the monitoring backend, typically with its own agents or some custom script. This is also called the push method of data collection.

The active/passive or pull/push model of data collection is standard across all monitoring systems. The method would depend on the type of metrics a monitoring system collects. You will see in later chapters that Datadog supports both methods.

Threshold

A threshold is a fixed value in the range of values possible for a metric. For example, on a root partition with a total disk space of 8 GB, the available disk space could be anything from 0 GB to 8 GB. A threshold in this specific case could be 1 GB, which could be set as an indication of low storage on the root partition.

There could be multiple thresholds defined too. In this specific example, 1 GB could be a warning threshold and 500 MB could be a critical or high-severity threshold.

Monitor

A monitor looks at the time-series data generated for a metric and alerts the alert recipients if the values cross the thresholds. A monitor can also be set up for the up/down status, in which case it alerts the alert recipients if the related resource is down.

Alert

An alert is produced by a monitor when the associated check determines that the metric value crosses a threshold set on the monitor. A monitor could be configured to notify the alert recipients of the alerts.

Alert recipient

The alert recipient is a user in the organization who signs up to receive alerts sent out by a monitor. An alert could be received by the recipient through one or more communication channels such, as E-Mail, SMS, and Slack.

Severity level

Alerts are classified by the seriousness of the issue that they unearth about the software system and that is set by the appropriate severity level. The response to an alert is tied to the severity level of the alert.

A sample set of severity levels could consist of the levels Informational, Warning, and Critical. For example, with our example of available disk space on the root partition, at 30% of available disk space, the monitor could be configured to alert as a warning and at 20% it could alert as critical.

As you can see, setting up severity levels for an increasing level of seriousness would provide the opportunity to catch issues and take mitigative actions ahead of time, which is the main objective of proactive monitoring. Note that this is possible in situations where a system component degenerates over a period of time.

A monitor that tracks an up/down status will not be able to provide any warning, and so a mitigative action would be to bring up the related service at the earliest. However, in a real-life scenario, there must be multiple monitors so at least one of them would be able to catch an underlying issue ahead of time. For example, having no disk space on the root partition can stop some services, and monitoring the available space on the root partition would help prevent those services from going down.

Notification

A message sent out as part of an alert specific to a communication platform such as email is called a notification. There is a subtle difference between an alert and a notification, but at times they are considered the same. An alert is a state within the monitoring system that can trigger multiple actions such as sending out notifications and updating monitoring dashboards with that status.

Traditionally, email distribution groups used to be the main communication method used to send out notifications. Currently, there are much more sophisticated options, such as chat and texting, available out of the box on most monitoring platforms. Also, escalation tools such as PagerDuty could be integrated with modern monitoring tools such as Datadog to route notifications based on severity.

Downtime

The downtime of a monitor is a time window during which alerting is disabled on that monitor. Usually, this is done for a temporary period while some change to the underlying infrastructure or software component is rolled out, during which time monitoring on that component is irrelevant. For example, a monitor that tracks the available space on a disk drive will be impacted while the maintenance task to increase the storage size is ongoing.

Monitoring platforms such as Datadog support this feature. The practical use of this feature is to avoid receiving notifications from the impacted monitors. By integrating a CI/CD pipeline with the monitoring application, the downtimes could be scheduled automatically as a prerequisite for deployments.

Event

An event published by a monitoring system usually provides details of a change that happened in the software system. Some examples are the following:

  • The restart of a process
  • The deployment or shutting down of a microservice due to a change in user traffic
  • The addition of a new host to the infrastructure
  • A user logging into a sensitive resource

Note that none of these events demand immediate action but are informational. That's how an event differs from an alert. A critical alert is actionable but there is no severity level attached to an event and so it is not actionable. Events are recorded in the monitoring system and they are valuable information when triaging an issue.

Incident

When a product feature is not available to users it is called an incident. An incident occurs when some outage happens in the infrastructure, this being hardware or software, but not always. It could also happen due to external network or internet-related access issues, though such issues are uncommon.

The process of handling incidents and mitigating them is an area by itself and not generally considered part of core monitoring. However, monitoring and incident management are intertwined for these reasons:

  • Not having comprehensive monitoring would always cause incidents because, without that, there is no opportunity to mitigate an issue before it causes an outage.
  • And of course, action items from a Root Cause Analysis (RCA) of an incident would have tasks to implement more monitors, a typical reactive strategy (or the lack thereof) that must be avoided.

On call

The critical alerts sent out by monitors are responded to by an on-call team. Though the actual requirements can vary based on the Service-Level Agreement (SLA) requirements of the application being monitored, on-call teams are usually available 24x7.

In a mature service engineering organization, three levels of support would be available, where an issue is escalated from L1 to L3:

  • The L1 support team consists of product support staff who are knowledgeable about the applications and can respond to issues using runbooks.
  • The L2 support team consists of Site Reliability Engineers (SREs) who might also rely on runbooks, but they are also capable of triaging and fixing the infrastructure and software components.
  • The L3 support team would consist of the DevOps and software engineers who designed and built the infrastructure and software system in production. Usually, this team gets involved only to triage issues that are not known.

Runbook

A runbook provides steps to respond to an alert notification for on-call support personnel. The steps might not always provide a resolution to the reported issue and it could be as simple as escalating the issue to an engineering point of contact to investigate the issue.

 

Types of monitoring

There are different types of monitoring. All types of monitoring that are relevant to a software system must be implemented to make it a comprehensive solution. Another aspect to consider is the business need of rolling out a certain type of monitoring. For example, if customers of a software service insist on securing the application they subscribe to, the software provider has to roll out security monitoring.

(The discussion on types of monitoring in this section originally appeared in the article Proactive Monitoring, published by me on DevOps.com.)

Figure 1.8 – Types of monitoring

Figure 1.8 – Types of monitoring

Let us now explore these types of monitoring in detail.

Infrastructure monitoring

The infrastructure that runs the application system is made up of multiple components: servers, storage devices, a load balancer, and so on. Checking the health of these devices is the most basic requirement of monitoring. The popular monitoring platforms support this feature out of the box. Very little customization is required except for setting up the right thresholds on those metrics for alerting.

Platform monitoring

An application system is usually built using multiple third-party tools such as the following:

  • Databases, both RDBMS (MySQL, Postgres) and NoSQL (MongoDB, Couchbase, Cassandra) data repositories
  • Full-text search engines (Elasticsearch)
  • Big data platforms (Hadoop, Spark)
  • Messaging systems (RabbitMQ)
  • Memory object caching systems (Memcached, Redis)
  • BI and reporting tools (MicroStrategy, Tableau)

Checking the health of these application components is important too. Most of these tools provide an interface, mainly via the REST API, that can be leveraged to implement plugins on the main monitoring platform.

Application monitoring

Having a healthy infrastructure and platform is not good enough for an application to function correctly. Defective code from a recent deployment or third-party component issues or incompatible changes with external systems can cause application failures. Application-level checks can be implemented to detect such issues. As mentioned earlier, a functional or integration test would unearth such issues in a testing/staging environment, and an equivalent of that should be implemented in the production environment also.

The implementation of application-level monitoring could be simplified by building hooks or API endpoints in the application. In general, improving the observability of the application is the key.

Monitoring is usually an after-thought and the requirement of such instrumentation is overlooked during the design phase of an application. The participation of the DevOps team in design reviews improves the operability of a system. Planning for application-level monitoring in production is one area where DevOps can provide inputs.

Business monitoring

The application system runs in production to meet certain business goals. You could have an application that runs flawlessly on a healthy infrastructure but still, the business might not be meeting its goals. It is important to provide that feedback to the business at the earliest opportunity to take corrective actions that might trigger enhancements of the application features and/or the way the business is run using the application.

These efforts should only complement the more complex BI-based data analysis methods that could provide deeper insights into the state of the business. Business-level monitoring can be based on transactional data readily available in the data repositories and the data aggregates generated by BI systems.

Both application- and business-level monitoring are company-specific, and plugins have to be developed for such monitoring requirements. Implementing a framework to access standard sources of information such as databases and REST APIs from the monitoring platform could minimize the requirement of building plugins from scratch every time.

Last-mile monitoring

A monitoring platform deployed in the same public cloud or data center environment as where the applications run cannot check the end user experience. To address that gap, there are several SaaS products available on the market, such as Catchpoint and Apica. These services are backed up by actual infrastructure to monitor the applications in specific geographical locations. For example, if you are keen on knowing how your mobile app performs on iPhones in Chicago, that could be tracked using the service provider's testing infrastructure in Chicago.

Alerts are set up on these tools to notify the site reliability engineering team if the application is not accessible externally or if there are performance issues with the application.

Log aggregation

In a production environment, a huge amount of information is logged in various log files, by operating system, platform components, and application. They will get some attention when issues happen and normally are ignored otherwise. Traditional monitoring tools such as Nagios couldn't handle the constantly changing log files except for alerting on some patterns.

The advent of log aggregation tools such as Splunk changed that situation. Using aggregated and indexed logs, it is possible to detect issues that would have gone unnoticed earlier. Alerts can be set up based on the info available in the indexed log data. For example, Splunk provides a custom query language to search indexes for operational insights. Using APIs provided by these tools, the alerting can actually be integrated with the main monitoring platform.

To leverage the aggregation and indexing capabilities of these tools, structured data outputs can be generated by the application or scripts that will be indexed by the log aggregation tool later.

Meta-monitoring

It is important to make sure that the monitoring infrastructure itself is up and running. Disabling alerting during deployment and forgetting about enabling it later is one of the common oversights that has been seen in operations. Such missteps are hard to monitor and only improvements in the deployment process can address such issues.

Let's look at a couple of popular methods used in meta-monitoring:

Pinging hosts

If there are multiple instances of the monitoring application running, or if there is a standby node, then cross-checks can be implemented to verify the availability of hosts used for monitoring. In AWS, CloudWatch can be used to monitor the availability of an EC2 node.

Health-check for monitoring

Checking on the availability of monitoring UI and activity in a monitoring tool's log files would ensure that the monitoring system itself is fully functional and it continues to watch for issues in the production environment. If a log aggregation tool is used, tracking the application's log files would be the most effective method to check whether there is any activity in the log file. The same index can also be queried for any potential issues by using standard keywords such as Error and Exception.

Noncore monitoring

The monitoring types that have been discussed thus far make up the components of a core monitoring solution. You will see most of these monitoring categories in a comprehensive solution. There are more monitoring types that are highly specialized and that would be important components in a specific business situation.

Security monitoring

Security monitoring is a vast area by itself and there are specialized tools available for that, such as SIEM tools. However, that is slowly changing and general-purpose monitoring tools including Datadog have started offering security features to be more competitive in the market. Security monitoring usually covers these aspects:

  • The vulnerability of the application system, including infrastructure, due to changes made to its state
  • The vulnerability of infrastructure components with respect to known issues
  • Monitoring attacks and catching security breaches

As you can see, these objectives might not strictly be covered by the core monitoring concepts we have discussed thus far and we'll have to bring in a new set of terminology and concepts to understand it better, and we will look at those details later in the book.

Application Performance Monitoring (APM)

As the name suggests, APM helps to fine-tune the application's performance. This is made possible by the improved observability of the application system in which the interoperability of various components is made more visible. Though these monitoring tools started off as dedicated APM solutions, full-stack monitoring is available these days so they can be used for general-purpose monitoring.

 

Overview of monitoring tools

In this section, you will obtain a good understanding of all the popular monitoring tools available on the market that will help you to evaluate Datadog better.

There are lots of monitoring tools available on the market, from open source, freeware products through licensed and cloud-based. While lots of tools such as Datadog are general-purpose applications that cover various monitoring types we have discussed earlier, some tools, such as Splunk and AppDynamics, address very specialized monitoring problems.

One challenge a DevOps architect would encounter when planning a monitoring solution is to evaluate the available tools for rolling out a proactive monitoring solution. In that respect, as we will see in this book, Datadog stands out as one of the best general-purpose monitoring tools as it supports the core monitoring features and also provides some non-core features such as security monitoring.

To bring some structure to the large and varied collection of monitoring tools available on the market, they are classified into three broad categories on the basis of where they actually run. Some of these applications are offered both on-premises and as a SaaS solution.

We will briefly look at what other monitoring applications are available on the market besides Datadog. Some of these applications are competing with Datadog and the rest could be complementary solutions to complete the stack of tools needed for rolling out proactive monitoring.

On-premises tools

This group of monitoring applications have to be deployed on your infrastructure to run alongside the application system. Some of these tools might also be available as an SaaS, and that will be mentioned where needed.

The objective here is to introduce the landscape of the monitoring ecosystem to newcomers to the area and show how varied it is.

Nagios

Nagios is a popular, first-generation monitoring application that is well known for monitoring systems and network infrastructure. Nagios is general-purpose, open source software that has both free and licensed versions. It is highly flexible software that could be extended using hundreds of plugins available widely. Also, writing plugins and deploying them to meet custom monitoring requirements is relatively easy.

Zabbix

Zabbix is another popular, first-generation monitoring application that is open source and free. It's a general-purpose monitoring application like Nagios.

TICK Stack

TICK stands for Telegraf, InfluxDB, Chronograf, and Kapacitor. These open source software components make up a highly distributed monitoring application stack and it is one of the popular new-generation monitoring platforms. While first-generation monitoring tools are basically monolithic software, new-generation platforms are divided into components that make them flexible and highly scalable. The core components of the TICK Stack perform these tasks:

  • Telegraf: Generates metrics time-series data.
  • InfluxDB: Stores time-series monitoring data for it to be consumed in various ways.
  • Chronograf: Provides a UI for metrics times-series data.
  • Kapacitor: Sets monitors on metrics time-series data.

Prometheus

Prometheus is a popular, new-generation, open source monitoring tool that collects metrics values by scraping the target systems. Basically, a monitoring system relies on collecting data using active checks or the pull method, as we discussed earlier. Prometheus-based monitoring has the following components:

  • The Prometheus server scrapes and stores time-series monitoring data.
  • Alertmanager handles alerts and integrates with other communication platforms, especially escalation tools such as PagerDuty and OpsGenie.
  • Node exporter is an agent that queries the operating system for a variety of metrics and exposes them over HTTP for other services to consume.
  • Grafana is not part of the Prometheus suite of tools specifically, but it is the most popular data visualization tool used along with Prometheus.

The ELK Stack

The ELK Stack is one of the most popular log aggregation and indexing systems currently in use. ELK stands for Elasticsearch, Logstash, and Kibana. Each component performs the following task in the stack:

  • Elasticsearch: It is the search and analytics engine.
  • Logstash: Logstash aggregates and indexes the logs for Elasticsearch.
  • Kibana: It is the UI visualization tool that users use to interact with the stack.

The ELK Stack components are open source software and free versions are available. SaaS versions of the stack are also available from multiple vendors as a licensed software service.

Splunk

Splunk is pioneering licensed software with a large install base in the log aggregation category of monitoring applications.

Zenoss

Zenoss is a first-generation monitoring application like Nagios and Zabbix.

Cacti

Cacti is a first-generation monitoring tool primarily known for network monitoring. Its features include automatic network discovery and network map drawing.

Sensu

Sensu is a modern monitoring platform that recognizes the dynamic nature of infrastructure at various levels. Using Sensu, the monitoring requirements can be implemented as code. The latter feature makes it stand out in a market with a large number of competing monitoring products.

Sysdig

The Sysdig platform offers standard monitoring features available with a modern monitoring system. Its focus on microservices and security makes it an important product to consider.

AppDynamics

AppDynamics is primarily known as an Application Performance Monitoring (APM) platform. However, its current version covers standard monitoring features as well. However, tools like this are usually an add-on to a more general-purpose monitoring platform.

SaaS solutions

Most new-generation monitoring tools such as Datadog are primarily offered as monitoring services in the cloud. What this means is that the backend of the monitoring solution is hosted on the cloud, and yet, its agent service must run on-premises to collect metrics data and ship that to the backend. Some tools are available both on-premises and as a cloud service.

Sumo Logic

Sumo Logic is a SaaS service offering for log aggregation and searching primarily. However, its impressive security-related features could also be used as a Security Information and Event Management (SIEM) platform.

New Relic

Though primarily known as an APM platform initially, like AppDynamics, it also supports standard monitoring features.

Dynatrace

Dynatrace is also a major player in the APM space, like AppDynamics and New Relic. Besides having the standard APM features, it also positions itself as an AI-driven tool that correlates monitoring events and flags abnormal activities.

Catchpoint

Catchpoint is an end user experience monitoring or last-mile monitoring solution. By design, such a service needs to be third-party provided as the related metrics have to be measured close to where the end users are.

There are several product offerings in this type of monitoring. Apica and Pingdom are other well-known vendors in this space.

Cloud-native tools

Popular public cloud platforms such as AWS, Azure, and GCP offer a plethora of services and monitoring is just one of them. Actually, there are multiple services that could be used for monitoring purposes. For example, AWS offers CloudWatch, which is primarily an infrastructure and platform monitoring service, and there are services such as GuardDuty that provide sophisticated security monitoring options.

Cloud-native monitoring services are yet to be widely used as general-purpose monitoring solutions outside of the related cloud platform even though Google operations and Azure Monitor are full-featured monitoring platforms.

However, when it comes to monitoring a cloud-specific compute, storage, or networking service, a cloud-native monitoring tool might be better suited. In such scenarios, the integration provided by the main monitoring platform can be used to consolidate monitoring in one place.

AWS CloudWatch

AWS CloudWatch provides infrastructure-level monitoring for the cloud services offered on AWS. It could be used as an independent platform to augment the main monitoring system or be integrated with the main monitoring system.

Google operations

This monitoring service available on GCP (formerly known as Stackdriver) is a full-stack, API-based monitoring platform that also provides log aggregation and APM features.

Azure Monitor

Azure Monitor is also a full-stack monitoring platform like operations on GCP.

Enterprise monitoring solutions

Though they don't strictly fall into the category of monitoring tools used for rolling out proactive monitoring, there have been other monitoring solutions used in large enterprises to cover varied requirements such as ITIL compliance. Let's look at some of those for the completeness of this overview:

  • IBM Tivoli Netcool/OMNIbus: An SLM system to monitor large, complex networks and IT domains. It's used in large IBM setups.
  • Oracle Enterprise Manager Grid Control: System management software that delivers centralized monitoring, administration, and life cycle management functionality for the complete Oracle IT infrastructure, including non-Oracle technologies. Commonly found in large Oracle hardware and software setups.
  • HPE Oneview: Hewlett Packard's Enterprise integrated IT solution for system management, monitoring, and software-defined infrastructure. Used in big HP, TANDEM, and HPE installations.
 

Summary

In this chapter, you learned the importance of monitoring a software system and how that is important for operating the business. You were also introduced to various types of monitoring, real-life use cases of monitoring, popular monitoring tools on the market, and, most importantly, the monitoring core concepts and terminology that are used across all the monitoring tools.

While this chapter provided a comprehensive overview of monitoring concepts, tools, and the market, the next chapter will introduce you to Datadog specifically and provide details on its agent, which has to be installed on your infrastructure to start using Datadog.

About the Author
  • Thomas Kurian Theakanath

    Thomas has over 15 years of experience in software development and operations engineering, and currently focussed on Cloud Engineering and DevOps. He has architected, developed, and rolled out automation tools and processes at startups and large companies. He led and mentored engineering teams to successful completion of large DevOps and monitoring projects. He contributes regularly to tech blogs on latest trends and best practices on his areas of expertise.

    Browse publications by this author
Latest Reviews (1 reviews total)
...................................
Datadog Cloud Monitoring Quick Start Guide
Unlock this book and the full library FREE for 7 days
Start now