Observability is a fast-growing new discipline, and all organizations want to adopt it. As you will see throughout this book, implementing observability is a journey that involves multiple teams and practices. Before you embark on this journey, it is important to understand what observability means, why it emerged, and how it can help.
This chapter will be the foundation for all the other chapters in this book. Additionally, we will introduce a fictional company that will be used throughout this book to discuss the concepts.
In a nutshell, the following topics will be covered in this chapter:
- What is observability?
- What was used before observability?
- Issues with traditional monitoring techniques
- Key benefits of observability
What is observability?
A quick Google search will give you definitions of observability in many forms from a variety of authors, vendors, and organizations. Since this book assumes you have a fair understanding of observability, we are not going to repeat a detailed definition here again. However, we will try to explain a few key concepts that are required for this book. In short, observability is not a tool, not a technology, not a strategy but a concept or a capability that will force you to think about how you are going to gain insights into the health of your application and services, at a conceptual stage of application development itself. It’s a combination of robust architecture and development practices, streamlining existing data management tools, and adopting and standardizing processes that will aid the former.
In simplistic terms, many people call observability next-generation monitoring or supercharged monitoring. But it’s fundamentally different in many ways. For starters, monitoring is fully dependent on a set of tools to generate the information required for operating a healthy application, while a highly observable system will generate the data that points toward existing problems or potential problems. For this to be achieved, the system developers and architects have to build observability capabilities into the product as a core function of the application itself, thereby reducing the dependency on external systems or tools to monitor. This is an ideal scenario; however, in reality, for observability, we have to depend on external applications to analyze the state of health of your applications and services. When observability is built within the application, it can reveal a lot more information about it, and as a result the dependencies on external systems or tools can be reduced significantly, as well as the cost.
Observability does not replace any application’s existing monitoring tools, but it standardizes and amalgamates the capabilities of Application Performance Monitoring (APM), log and metrics management tools, and the data that’s generated from applications, and effectively uses the distributed tracing methodology to achieve observability.
The Holy Grail of automation is the ability of the applications or systems to find out their issues and problems and self-heal before the users are impacted. Hence, observability can be considered a stepping stone for self-healing applications.
Throughout this book, we will use an example of a fictional company called MK Tea that supplies varieties of tea across the globe. They collect tea leaves from various locations, get them trucked to their plants, sort the tea as per flavors, quality, grade, package, and ship them all over the world. This entire process has many moving parts – each location where tea grows has different soil, moisture, and altitude characteristics; tea leaves, once collected, are packed and trucked to the plant by a trucking company; tea leaf sorting happens at the plant, which is an important, tedious, and manual process; tea leaves are crushed into powders of different grain sizes or dried and retained as leaves, packaged by machines, and shipped off to suppliers all over the world based on the demand for various flavors. We will see how observability can help MK Tea manage its overwhelming process, which involves human labor, skilled technicians, and fully automated machines.
What was used before observability?
Observability, as a term, this contradicts what you say a couple of sentences later, where you say the term was coined in 1960. please review the wording of this paragraph, with Google’s definition stating “observability is defined as a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” This started doing rounds in technical talks and presentations. This definition was coined by engineer Rudolf E. Kálmán in 1960 in his paper on control theory. In the modern IT world, observability is just a concept. Even before it became the talk of the town, some engineers were probably already building rounded monitoring systems and the ecosystem around it that made their services observable. It’s just that they did not know the buzzword!
In a single instance of a web application, you can add some scripting to check the service’s status, use Nagios to monitor the infrastructure, write smart logs and scrape them with scripts or some tool to keep an eye on connectivity and errors, plug results into a ticketing system such as BMC or set up SNMP traps, and there you go! The system is observable, yes that’s true – all aspects of the system are covered, engineers have a hold on the infrastructure and services, they know whether the systems have connectivity, and tickets are raised when something goes wrong. It’s all there. Hold on – there is something still missing though, which we will discover at the end of this section.
When thinking of what was used before observability, we are not talking about mainframe systems that were a black box for decades until some bright brains opened up that tough nut with Syncsort; there is no need to start from the beginning to understand what was used before observability. In the 90s, software and desktops were batch-oriented, had single instances, and focused less on the GUI. The outputs that they emitted were either hardware signals or code that only a few skilled technicians could decipher. With the advent of sophisticated OSs such as Linux, the game started to evolve and you might be surprised that, for a long time, humble commands such as
syslogs were sought after as monitoring tools for Linux and Unix-based OSs. But we will not start from there either.
As an example, take a look at the following figure for a quick contrast between the humble beginnings of monitoring and its current state:
Figure 1.1 – Monitoring then (left) and now (right) (Creative Commons—Attribution 2.0 Generic—CC BY 2.0)
Let’s fast forward a bit. The world started shrinking with the internet when the era of eCommerce started. All of a sudden, single-instance apps started evolving into monolithic apps (which we know entered the black hole soon after). And this is where we will start!
With eCommerce, infrastructure monitoring and traffic-light monitoring of services was not enough. Businesses needed frequent metrics on products, web traffic, and, most importantly, user behavior to assess current business and actionable insights to make future decisions for expansion. These came to be known as business metrics – data for the eyes of the executives. Logs being produced could no longer be at the mercy of the developers; logging frameworks and normalization techniques were introduced to help developers produce meaningful logs that could be used to derive application health and business metrics. Early-age monitoring tools such as Cacti, Nagios, scripts (shell or Python), and some commands could only cater to a handful of the monitoring requirements. Areas such as APM, customer behavior analytics, and measuring incident impact on customers remained largely untouched. As eCommerce platforms gained popularity, the volume of customers increased, and data volumes started exploding way beyond the capacity of the available monitoring tools. System architectures evolved from monolithic to distributed, making it even more difficult for traditional monitoring techniques to provide meaningful insights.
As the tech stack was increasing, each technology or tool started offering a monitoring tool. Windows had Event Viewer and SCOM, Linux had its commands, databases had RockSolid and OEM, Unix had HP products, and Apache servers had highly structured standard logs – this list can go on. Soon, the monitoring space was cluttered with micro-monitoring tools when the need was to have macro monitoring that could provide an end-to-end unified view of the distributed platforms consisting of various technologies.
Figure 1.2 – Log volume ingestion growth (source: Gartner)
So, before observability, there was only monitoring, which was limited to a particular technology in most cases. Then, a lot of big data monitoring tools were introduced, such as AppDynamics, New Relic, Splunk, Dynatrace, and others, that could collect data from various sources and make it available to end users on a single screen. The micro-based monitoring bubbles soon started converging toward these tools and a mature ecosystem started shaping up. When you look at the fancy visualizations that these tools offer, it’s hard to believe that monitoring in its primitive days was based on hardware-based signals, commands, and scripts.
Issues with traditional monitoring techniques
Traditional monitoring techniques focused on collecting and analyzing a few predefined metrics and leveraging them to analyze the system’s health and use them for alerting. IT systems were managed and operated in isolation and all the IT management and engineering processes in an organization were framed around this construct and followed this isolated approach. Many IT system providers created monitoring tools to primarily monitor the application’s health in isolation.
Let’s discuss the issues with traditional monitoring techniques and why they no longer fit the bill for observability implementations. You may already be past these challenges, but we still recommend reading through each of the challenges as we talk about them from an observability perspective.
Let’s consider a service that depends on three applications. The traditional approach Would have identified key parameters that define the health of each of these three individual applications. Each of these services will be monitored separately, assuming that if the applications are healthy individually then the business service that depends on these applications (fully or partially) would also be healthy and will serve the customers efficiently. There was no concept of service in this approach.
This method would have worked well for a traditional infrastructure, where the application was monolithic and hosted on physical hardware in data centers. This guaranteed a certain amount of resources for the application to run. Then came virtualization, which added another layer on top of the physical hardware, and the guarantee of dedicated resources was gone. The adoption of cloud infrastructure services such as AWS and GCP and cloud-native technologies such as serverless architecture, microservices, and containers have completely de-coupled infrastructure and applications, making the IT system more complex and interdependent. These technologies have introduced a level of unpredictability in IT systems’ operations. Hence the concepts, practices, and tools used for managing and maintaining the health of applications also have to change accordingly.
One of the key issues with the traditional monitoring approach is that you pre-empt the metrics that need to be collected and monitored. Many of these key indicators or metrics are decided based on the past experiences of vendors, administrators, and system engineers. With more experience, engineers can come up with multiple and better key indicators. While this was effective to a certain extent in traditional infrastructure environments, modern distributed architecture has introduced a lot of interdependencies and complexity in IT environments, where the source of the problems or issues can drastically vary. Hence, pre-empting potential health indicators or metrics can be quite inaccurate and challenging.
Identifying why and where the problem exists
The main purpose of conventional monitoring is to detect when there is a problem. This provides a simple green, amber, or red health status indication but doesn’t answer why and where the issue originates. Once the issues have been flagged, it’s up to the administrators and engineers to figure out where and why the problem exists. Since modern infrastructure services are very transient, identifying the source of the problem is quite difficult or time-consuming. Hence, answering why and where as quickly as possible is critical in reducing MTTR and maintaining a stable service.
Key benefits of observability
The first step toward implementing observability is not just knowing application design, infrastructure, and business functions – it’s also about considering customer behavior, the impact of incidents, application performance, adoption in the market, and the dollar value, to name a few. All members of the team need to come together to implement observability.
From the inception stage, you will require inputs from architects for design, developers for putting it together, operations for ensuring the right alert triggers, the business for clearly defining what they need, and a strategy to assess customer behavior and impact. As the project proceeds in the development and testing phases, continue to assess measures that help establish the success of a business function. Ensure that those measures are captured in outputs (logs/metrics/traces). Ensure that applications are not seen in silos but can be correlated as per business functions. This will give you visibility into business metrics and their impact on customers when things go south. The responsibility of knowing the fine-grained details of the app is shifted from architects and business analysts to every member of the team.
By now, if you have gathered that observability requires planning and hard work to implement, then you are on the right path! Congratulations, you have achieved your first milestone in your observability journey. It’s not something that you think about at the end of the project so that you can tick a box before it’s released into production. You need to think of observability from the inception of your new projects, plan for it and reframe the perspective for existing projects, and replan your observability strategy. We will talk about this a lot throughout this book. After all, this book has been purposely written to help you plan for observability.
- Correlated applications that deliver higher business value
Modern architectures are delivered with crippling complexity, sophisticated infrastructure, smart networks, and an intertwined web of applications. A transaction originating in an on-premises web application may end up traversing containerized applications hosted in the cloud before it reaches completion. Observability lets you embrace this complexity as it focuses on correlating applications. Breaks or slowness in any application will quickly map out the impact on other applications, business functions, and customers. If your applications are observable, you will observe that the conversation in war rooms will change from bringing up the application to restoring business functions and minimizing customer impact.
- An improved customer experience that drives customer loyalty
Observability delivers information faster. A high-severity incident may be super critical for infrastructure but if that particular infrastructure is only serving a very small percentage of low-value customers, it is not a high-priority incident. Observability gives you this information. It also tells you the symptoms before the customers sense them, giving you a thin window to analyze, detect, and act. Sometimes, the issues can’t be fixed in this thin window, but you can still use the time to prepare your response to the customers so that social media doesn’t explode and the service desk responds coherently. All your investments in observability are bound to result in improved customer experience.
- Tools rationalization for improved ROI
Cut down the time required in interacting with various teams to identify the epicenter of the problem by integrating available tools that provide relevant insights for your application. Allow the tools to work in their own space but integrate the important metrics (infrastructure, application processes, deployments, database, networks, SRE, business, capacity, and more) from all the tools into a single tool that can easily construct and deconstruct your application, enabling you to measure performance on good days and manage incidents. A single or set of carefully chosen tools for observing business functions will also increase the transparency in the team as every single member of the team will have access to the same level of insights. Modern applications can generate a ton of data at high velocity. Observability helps in optimizing the data generation and collection mechanism to improve reliability and reduce cost by managing big data problems.
- Focus on not just tech but also the process
To get the data you need, don’t just look at writing the enterprise-grade application code. You should also invest in the process so that the problem’s remediation is part of the system design. Automate all repetitive tasks along the way. It will give your team agility and reduce the room for human error. Choosing the best tool and technology will only pay off if it opens up the visibility of your system. It’s not always possible to achieve 100% automation, so introduce robust practices that provide enough checkpoints to trace a problem, such as Git commits and peer reviews. Writing code can’t be fully automated, but introducing Git builds a strong process around the manual task that gives end-to-end visibility into what has been deployed on the servers.
- Data noise is converted into actionable insights
Correlating applications, consolidating different tools, and ingesting telemetry data can easily lead to large volumes of data, often referred to as data noise. Your observability design may be capturing thousands of parameters; what brings value is knowing which measures are central to delivering a particular business function. In observable systems, post-incident analysis is more fruitful as all the parties involved have access to information from all other teams that are involved. There is no place for playing the blame game or meddling with the information that was available only in silos earlier. Just imagine the magic observability would bring to MTTR with all its correlated systems and the involvement of different perspectives. Observable systems will allow you to take a head-on approach for the busiest days of the year as every aspect of the system is being watched and the slightest of slip-ups can be easily identified and assessed for impact. It empowers the decision-makers.
- Foundation for a self-healing architecture
In a complex and interconnected IT system environment, a self-healing architecture can help in guaranteeing the service’s health by quickly identifying an outage in a component or a situation that can cause an outage, and then deploying countermeasures to prevent the issue from happening or resolving the issue quickly to reduce the impact on the end customer. As you may have noticed, identifying a problem or a potential problem is a critical part of the self-healing architecture. For the self-healing actions to be effective, the detection of the problems has to be as close to real time as possible, and they must be effective and comprehensive. This is where the need for observability comes in – to be able to monitor the health of an application at the OS, application, and user experience levels.
Along with the benefits outlined in this section, observability brings many intangible benefits, such as a focus on creating service maps, strengthening CMDB, a change in the mindset of developers, and supporting people. It brings about not just a technical shift but also a cultural one. However, you can use these benefits to pitch for your observability journey. Also, keep these benefits in mind while designing observability so that you build a quality observability mechanism in the first iteration.
Now that we have come to the end of this chapter, we hope that you have a fair understanding of observability and how it differs from monitoring. Infrastructure and application monitoring has always existed in the IT landscape, but the techniques are no longer enough for modern and complex application architectures. The major drivers of this fallout are the volume and velocity of the data generated by complex and modern architectures. Therefore, observability is seen as next-generation monitoring that correlates assets, applications, businesses, and customers.
People, tools, and the organization’s culture play a major role in observability implementations; we will discuss them in detail in further chapters.