You're reading from The DevOps 2.5 Toolkit

Product typeBook

Published inNov 2019

PublisherPackt

ISBN-139781838647513

Edition1st Edition

Tools

Kubernetes

Concepts

DevOps

Author (1)

Viktor Farcic

Collecting and Querying Metrics and Sending Alerts

Insufficient facts always invite danger.

- Spock

So far, we explored how to leverage some of Kubernetes core features. We used HorizontalPodAutoscaler and Cluster Autoscaler. While the former relies on Metrics Server, the latter is not based on metrics, but on Scheduler's inability to place Pods within the existing cluster capacity. Even though Metrics Server does provide some basic metrics, we are in desperate need for more.

We have to be able to monitor our cluster and Metrics Server is just not enough. It contains a limited amount of metrics, it keeps them for a very short period, and it does not allow us to execute anything but simplest queries. I can't say that we are blind if we rely only on Metrics Server, but that we are severely impaired. Without increasing the number of metrics we're collecting, as well...

Creating a cluster

We'll continue using definitions from the vfarcic/k8s-specs (https://github.com/vfarcic/k8s-specs) repository. To be on the safe side, we'll pull the latest version first.

All the commands from this chapter are available in the 03-monitor.sh (https://gist.github.com/vfarcic/718886797a247f2f9ad4002f17e9ebd9) Gist.

 1  cd k8s-specs
 2
 3  git pull

In this chapter, we'll need a few things that were not requirements before, even though you probably already used them.

We'll start using UIs so we'll need NGINX Ingress Controller that will route traffic from outside the cluster. We'll also need environment variable LB_IP with the IP through which we can access worker nodes. We'll use it to configure a few Ingress resources.

The Gists used to test the examples in this chapters are below. Please use them as they are, or as inspiration...

Choosing the tools for storing and querying metrics and alerting

HorizontalPodAutoscaler (HPA) and Cluster Autoscaler (CA) provide essential, yet very rudimentary mechanisms to scale our Pods and clusters.

While they do scaling decently well, they do not solve our need to be alerted when there's something wrong, nor do they provide enough information required to find the cause of an issue. We'll need to expand our setup with additional tools that will allow us to store and query metrics as well as to receive notifications when there is an issue.

If we focus on tools that we can install and manage ourselves, there is very little doubt about what to use. If we look at the list of Cloud Native Computing Foundation (CNCF) projects (https://www.cncf.io/projects/), only two graduated so far (October 2018). Those are Kubernetes and Prometheus (https://prometheus.io/). Given...

A quick introduction to Prometheus and Alertmanager

We'll continue the trend of using Helm as the installation mechanism. Prometheus' Helm Chart is maintained as one of the official Charts. You can find more info in the project's README (https://github.com/helm/charts/tree/master/stable/prometheus). If you focus on the variables in the Configuration section (https://github.com/helm/charts/tree/master/stable/prometheus#configuration), you'll notice that there are quite a few things we can tweak. We won't go through all the variables. You can check the official documentation for that. Instead, we'll start with a basic setup, and extend it as our needs increase.

Let's take a look at the variables we'll use as a start.

 1  cat mon/prom-values-bare.yml

The output is as follows.

server:
  ingress:
    enabled: true
    annotations:
      ingress...

Which metric types should we use?

If this is the first time you're using Prometheus hooked into metrics from Kube API, the sheer amount might be overwhelming. On top of that, consider that the configuration excluded many of the metrics offered by Kube API and that we could extend the scope even further with additional exporters.

While every situation is different and you are likely to need some metrics specific to your organization and architecture, there are some guidelines that we should follow. In this section, we'll discuss the key metrics. Once you understand them through a few examples, you should be able to extend their use to your specific use-cases.

The four key metrics everyone should utilize are latency, traffic, errors, and saturation.

Those four metrics been championed by Google Site Reliability Engineers (SREs) as the most fundamental metrics for tracking...

Alerting on unschedulable or failed pods

Knowing whether our applications are having trouble to respond fast to requests, whether they are being bombed with more requests than they could handle, whether they produce too many errors, and whether they are saturated, is of no use if they are not even running. Even if our alerts detect that something is wrong by notifying us that there are too many errors or that response times are slow due to an insufficient number of replicas, we should still be informed if, for example, one, or even all replicas failed to run. In the best case scenario, such a notification would provide additional info about the cause of an issue. In the much worse situation, we might find out that one of the replicas of the DB is not running. That would not necessarily slow it down nor it would produce any errors but would put us in a situation where data could...

Upgrading old Pods

Our primary goal should be to prevent issues from happening by being proactive. In cases when we cannot predict that a problem is about to materialize, we must, at least, be quick with our reactive actions that mitigate the issues after they occur. Still, there is a third category that can only loosely be characterized as being proactive. We should keep our system clean and up-to-date.

Among many things we could do to keep the system up-to-date, is to make sure that our software is relatively recent (patched, updated, and so on). A reasonable rule could be to try to renew software after ninety days, if not earlier. That does not mean that everything we run in our cluster should be newer than ninety days, but that it might be a good starting point. Further on, we might create finer policies that would allow some kinds of applications (usually third-party) to...

Measuring containers memory and CPU usage

If you are familiar with Kubernetes, you understand the importance of defining resource requests and limits. Since we already explored kubectl top pods command, you might have set the requested resources to match the current usage, and you might have defined the limits as being above the requests. That approach might work on the first day. But, with time, those numbers will change and we will not be able to get the full picture through kubectl top pods. We need to know how much memory and CPU containers use when on their peak loads, and how much when they are under less stress. We should observe those metrics over time, and adjust periodically.

Even if we do somehow manage to guess how much memory and CPU a container needs, those numbers might change from one release to another. Maybe we introduced a feature that requires more memory or...

Comparing actual resource usage with defined requests

If we define container resources inside a Pod and without relying on actual usage, we are just guessing how much memory and CPU we expect a container to use. I'm sure that you already know why guessing, in the software industry, is a terrible idea, so I'll focus on Kubernetes aspects only.

Kubernetes treats Pods with containers that do not have specified resources as the BestEffort Quality of Service (QoS). As a result, if it ever runs out of memory or CPU to serve all the Pods, those are the first to be forcefully removed to leave space for others. If such Pods are short lived as, for example, those used as one-shot agents for continuous delivery processes, BestEffort QoS is not a bad idea. But, when our applications are long-lived, BestEffort QoS should be unacceptable. That means that in most cases, we do have...

Comparing actual resource usage with defined limits

Knowing when a container uses too much or too few resources compared to requests helps us be more precise with resource definitions and, ultimately, help Kubernetes make better decisions where to schedule our Pods. In most cases, having too big of a discrepancy between requested and actual resource usage will not result in malfunctioning. Instead, it is more likely to result in an unbalanced distribution of Pods or in having more nodes than we need. Limits, on the other hand, are a different story.

If resource usage of our containers enveloped as Pods reaches the specified limits, Kubernetes might kill those containers if there's not enough memory for all. It does that as a way to protect the integrity of the rest of the system. Killed Pods are not a permanent problem since Kubernetes will almost immediately reschedule them...

What now?

We explored quite a few Prometheus metrics, expressions, and alerts. We saw how to connect Prometheus alerts with Alertmanager and, from there, to forward them to one application to another.

What we did so far is only the tip of the iceberg. If would take too much time (and space) to explore all the metrics and expressions we might use. Nevertheless, I believe that now you know some of the more useful ones and that you'll be able to extend them with those specific to you.

I urge you to send me expressions and alerts you find useful. You know where to find me (DevOps20 (http://slack.devops20toolkit.com/) Slack, viktor@farcic email, @vfarcic on Twitter, and so on).

For now, I'll leave you to decide whether to move straight into the next chapter, to destroy the entire cluster, or only to remove the resources we installed. If you choose the latter, please use the...

The rest of the chapter is locked

You have been reading a chapter from

The DevOps 2.5 Toolkit

Published in: Nov 2019Publisher: PacktISBN-13: 9781838647513

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Viktor Farcic

Viktor Farcic is a senior consultant at CloudBees, a member of the Docker Captains group, and an author. He codes using a plethora of languages starting with Pascal (yes, he is old), Basic (before it got the Visual prefix), ASP (before it got the .NET suffix), C, C++, Perl, Python, ASP.NET, Visual Basic, C#, JavaScript, Java, Scala, and so on. He never worked with Fortran. His current favorite is Go. Viktor's big passions are Microservices, Continuous Deployment, and Test-Driven Development (TDD). He often speaks at community gatherings and conferences. Viktor wrote Test-Driven Java Development by Packt Publishing, and The DevOps 2.0 Toolkit. His random thoughts and tutorials can be found in his blog—Technology Conversations
Read more about Viktor Farcic

Other recommended products

Related to this chapter

The DevOps 2.4 Toolkit

The DevOps Toolkit 2.4 is a deep exploration of continuous delivery and deployment in Kubernetes using Jenkins. It shows readers how to build, test, and deploy applications in Kubernetes using fully automated Jenkins pipelines.

BookNov 2019398 pages

The DevOps 2.2 Toolkit

Viktor Farcic’s latest book, The DevOps 2.2 Toolkit: Self-Sufficient Docker Clusters, takes you deeper into Docker, teaching you with a mixture of theory and hands-on how to successfully build both self-adaptive and self-healing-based systems.

BookMar 2018360 pages

kubectl: Command-Line Kubernetes in a Nutshell

Kubernetes is a de facto container orchestration system. To manage Kubernetes, you must understand how to work with kubectl, its command-line tool that lets you interact with your clusters to deploy applications and manage their lifecycle. This book is a comprehensive introduction for those who are new to Kubernetes management via the command line.

BookNov 2020136 pages

The DevOps 2.3 Toolkit

Viktor Farcic’s latest book, The DevOps 2.3 Toolkit: Kubernetes, will take you on a hands-on journey with Viktor into the world of Kubernetes, and the tools not only behind the official project but also the wide-range of third-party apps that are available for you to use.

BookSep 2018418 pages

Kubernetes Cookbook

Kubernetes is one of the most popular, sophisticated, and fast-evolving container orchestrators. In this book, you’ll learn the essentials and find out about the advanced administration in Kubernetes. We’ll take you through a step-by-step hands-on approach, which will familiarize you with the Kubernetes ecosystem.

BookMay 2018554 pages

Kubernetes - A Complete DevOps Cookbook

Kubernetes is one of the most popular, sophisticated, and fast-evolving container orchestrators. In this book, you’ll learn the essentials and find out about the advanced administration and orchestration techniques in Kubernetes. Readers will also learn to manage containers using the latest version of Kubernetes with a recipe-based approach.

BookMar 2020584 pages

The DevOps 2.1 Toolkit: Docker Swarm

Viktor Farcic's latest book, The DevOps 2.1 Toolkit: Docker Swarm, takes you deeper into one of the major subjects of his international best seller, The DevOps 2.0 Toolkit, and shows you how to successfully integrate Docker Swarm into your DevOps toolset.

BookMay 2017436 pages

Kubernetes on AWS

Docker containers promise to radicalize the way developers build, deploy, and manage applications running in the cloud. Kubernetes provides the orchestration tools needed to realize that promise in production. In this book, you will learn to deploy a production-ready Kubernetes cluster on the AWS platform and also discover the power of Kubernetes.

BookNov 2018270 pages

Hands-On Kubernetes on Windows

Starting with release 1.14, Kubernetes has production support for Windows Containers. This book is designed to help developers, architects and DevOps engineers working in the Windows ecosystem to deploy and orchestrate cloud applications using Windows Containers on Kubernetes.

BookMar 2020592 pages

DevOps with Kubernetes

This book will guide you from container basic concepts to orchestrating containerized applications in Kubernetes. You’ll learn about the Kubernetes basic architecture, components, resources, admission control, and extensions. We’ll show you how to utilize Kubernetes services in three top cloud providers.

BookJan 2019484 pages

Hands-On Infrastructure Monitoring with Prometheus

Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach. This book will be your practical guide to setup Prometheus on your cloud, virtual machine, container, and server ecosystem.

BookMay 2019442 pages2

Cloud Native with Kubernetes

Cloud Native with Kubernetes will guide you effectively - from your first steps using Kubernetes right up to running enterprise-grade cloud native applications with best practices. The book covers every aspect of deploying, securing, and operating modern cloud native applications on Kubernetes.

BookJan 2021446 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2