Chapter 8. Monitoring and Logging
This chapter will cover the usage and customization of both built-in and third-party monitoring tools on our Kubernetes cluster. We will cover how to use the tools to monitor the health and performance of our cluster. In addition, we will look at built-in logging, the Google Cloud Logging service, and Sysdig.
This chapter will discuss the following topics:
- How Kuberentes uses cAdvisor, Heapster, InfluxDB, and Grafana
- Customizing the default Grafana dashboard
- Using FluentD and Grafana
- Installing and using logging tools
- Working with popular third-party tools, such as StackDriver and Sysdig, to extend our monitoring capabilities
Real-world monitoring goes far beyond checking whether a system is up and running. Although health checks, like those you learned in Chapter 2, Pods, Services, Replication Controllers, and Labels, in the Health checks section, can help us isolate problem applications. Operation teams can best serve the business when they can anticipate the issues and mitigate them before a system goes offline.
Best practices in monitoring are to measure the performance and usage of core resources and watch for trends that stray from the normal baseline. Containers are not different here, and a key component to managing our Kubernetes cluster is having a clear view into performance and availability of the OS, network, system (CPU and memory), and storage resources across all nodes.
In this chapter, we will examine several options to monitor and measure the performance and availability of all our cluster resources. In addition, we will look at a few options for alerting and notifications...
If you recall from Chapter 1, Introduction to Kubernetes, we noted that our nodes were already running a number of monitoring services. We can see these once again by running the get pods
command with the kube-system
namespace specified as follows:
$ kubectl get pods --namespace=kube-system
The following screenshot is the result of the preceding command:
System pod listing
Again, we see a variety of services, but how does this all fit together? If you recall the Node (formerly minions) section from Chapter 2, Pods, Services, Replication Controllers, and Labels, each node is running a kublet. The kublet is the main interface for nodes to interact and update the API server. One such update is the metrics of the node resources. The actual reporting of the resource usage is performed by a program named cAdvisor.
cAdvisor is another open-source project from Google, which provides various metrics on container resource use. Metrics include CPU, memory, and network statistics. There...
FluentD and Google Cloud Logging
Looking back at the System pod listing screenshot at the beginning of the chapter, you may have noted a number of pods starting with the words fluentd-cloud-logging-kubernetes...
. These pods appear when using the GCE provider for your K8s cluster. A pod like this exists on every node in our cluster and its sole purpose is to handle the processing of Kubernetes logs.
If we log in to our Google Cloud Platform account, we can see some of the logs processed there. Simply use the left side, under Stackdriver
select Logging
.This will take us to a log listing page with a number of drop-down menus on the top. If this is your first time visiting the page, the first dropdown will likely be set to Cloud HTTP Load Balancer
.
In this drop-down menu, we'll see a number of GCE types of entries. Select GCE VM Instances and then the Kubernetes master or one of the nodes. In the second dropdown, we can choose various log groups, including kublet.
We can also filter by the event...
Maturing our monitoring operations
While Grafana gives us a great start to monitor our container operations, it is still a work in progress. In the real world of operations, having a complete dashboard view is great once we know there is a problem. However, in everyday scenarios, we'd prefer to be proactive and actually receive notifications when issues arise. This kind of alerting capability is a must to keep the operations team ahead of the curve and out of reactive mode.
There are many solutions available in this space, and we will take a look at two in particular—GCE monitoring (StackDriver) and Sysdig.
StackDriver is a great place to start for infrastructure in the public cloud. It is actually owned by Google, so it's integrated as the Google Cloud Platform monitoring service. Before your lock-in alarm bells start ringing, StackDriver also has solid integration with AWS. In addition, StackDriver has alerting capability with support for notification to a variety of platforms...
We took a quick look at monitoring and logging with Kubernetes. You should now be familiar with how Kubernetes uses cAdvisor and Heapster to collect metrics on all the resources in a given cluster. Furthermore, we saw how Kubernetes saves us time by providing InfluxDB and Grafana set up and configured out of the box. Dashboards are easily customizable for our everyday operational needs.
In addition, we looked at the built-in logging capabilities with FluentD and the Google Cloud Logging service. Also, Kubernetes gives us great time savings by setting up the basics for us.
Finally, you learned about the various third-party options available to monitor our containers and clusters. Using these tools will allow us to gain even more insight into the health and status of our applications. All these tools combine to give us a solid toolset to manage day-to-day operations.
In the next chapter, we will explore the new cluster federation capabilities. Still mostly in beta, this functionality...