Chapter 1: Machine Learning for IT
A decade ago, the idea of using machine learning (ML)-based technology in IT operations or IT security seemed a little like science fiction. Today, however, it is one of the most common buzzwords used by software vendors. Clearly, there has been a major shift in both the perception of the need for the technology and the capabilities that the state-of-the-art implementations of the technology can bring to bear. This evolution is important to fully appreciate how Elastic ML came to be and what problems it was designed to solve.
This chapter is dedicated to reviewing the history and concepts behind how Elastic ML works. It also discusses the different kinds of analysis that can be done and the kinds of use cases that can be solved. Specifically, we will cover the following topics:
- Overcoming the historical challenges in IT
- Dealing with the plethora of data
- The advent of automated anomaly detection
- Unsupervised versus supervised ML
- Using unsupervised ML for anomaly detection
- Applying supervised ML to data frame analytics
Overcoming the historical challenges in IT
IT application support specialists and application architects have a demanding job with high expectations. Not only are they tasked with moving new and innovative projects into place for the business, but they also have to keep currently deployed applications up and running as smoothly as possible. Today's applications are significantly more complicated than ever before—they are highly componentized, distributed, and possibly virtualized/containerized. They could be developed using Agile, or by an outsourced team. Plus, they are most likely constantly changing. Some DevOps teams claim they can typically make more than 100 changes per day to a live production system. Trying to understand a modern application's health and behavior is like a mechanic trying to inspect an automobile while it is moving.
IT security operations analysts have similar struggles in keeping up with day-to-day operations, but they obviously have a different focus of keeping the enterprise secure and mitigating emerging threats. Hackers, malware, and rogue insiders have become so ubiquitous and sophisticated that the prevailing wisdom is that it is no longer a question of whether an organization will be compromised—it's more of a question of when they will find out about it. Clearly, knowing about a compromise as early as possible (before too much damage is done) is preferable to learning about it for the first time from law enforcement or the evening news.
So, how can they be helped? Is the crux of the problem that application experts and security analysts lack access to data to help them do their job effectively? Actually, in most cases, it is the exact opposite. Many IT organizations are drowning in data.
Dealing with the plethora of data
IT departments have invested in monitoring tools for decades, and it is not uncommon to have a dozen or more tools actively collecting and archiving data that can be measured in terabytes, or even petabytes, per day. The data can range from rudimentary infrastructure- and network-level data to deep diagnostic data and/or system and application log files.
Business-level key performance indicators (KPIs) could also be tracked, sometimes including data about the end user's experience. The sheer depth and breadth of data available, in some ways, is the most comprehensive than it has ever been. To detect emerging problems or threats hidden in that data, there have traditionally been several main approaches to distilling the data into informational insights:
- Filter/search: Some tools allow the user to define searches to help trim down the data into a more manageable set. While extremely useful, this capability is most often used in an ad hoc fashion once a problem is suspected. Even then, the success of using this approach usually hinges on the ability for the user to know what they are looking for and their level of experience—both with prior knowledge of living through similar past situations and expertise in the search technology itself.
- Visualizations: Dashboards, charts, and widgets are also extremely useful to help us understand what data has been doing and where it is trending. However, visualizations are passive and require being watched for meaningful deviations to be detected. Once the number of metrics being collected and plotted surpasses the number of eyeballs available to watch them (or even the screen real estate to display them), visual-only analysis becomes less and less useful.
- Thresholds/rules: To get around the requirement of having data be physically watched in order for it to be proactive, many tools allow the user to define rules or conditions that get triggered upon known conditions or known dependencies between items. However, it is unlikely that you can realistically define all appropriate operating ranges or model all of the actual dependencies in today's complex and distributed applications. Plus, the amount and velocity of changes in the application or environment could quickly render any static rule set useless. Analysts find themselves chasing down many false positive alerts, setting up a boy who cried wolf paradigm that leads to resentment of the tools generating the alerts and skepticism of the value that alerting could provide.
Ultimately, there needed to be a different approach—one that wasn't necessarily a complete repudiation of past techniques, but could bring a level of automation and empirical augmentation of the evaluation of data in a meaningful way. Let's face it, humans are imperfect—we have hidden biases and limitations of capacity for remembering information and we are easily distracted and fatigued. Algorithms, if used correctly, can easily make up for these shortcomings.
The advent of automated anomaly detection
ML, while a very broad topic that encompasses everything from self-driving cars to game-winning computer programs, was a natural place to look for a solution. If you realize that most of the requirements of effective application monitoring or security threat hunting are merely variations on the theme of find me something that is different from normal, then the discipline of anomaly detection emerges as the natural place to begin using ML techniques to solve these problems for IT professionals.
The science of anomaly detection is certainly nothing new, however. Many very smart people have researched and employed a variety of algorithms and techniques for many years. However, the practical application of anomaly detection for IT data poses some interesting constraints that make the otherwise academically worthy algorithms inappropriate for the job. These include the following:
- Timeliness: Notification of an outage, breach, or other significant anomalous situation should be known as quickly as possible to mitigate it. The cost of downtime or the risk of a continued security compromise is minimized if remedied or contained quickly. Algorithms that cannot keep up with the real-time nature of today's IT data have limited value.
- Scalability: As mentioned earlier, the volume, velocity, and variation of IT data continue to explode in modern IT environments. Algorithms that inspect this vast data must be able to scale linearly with the data to be usable in a practical sense.
- Efficiency: IT budgets are often highly scrutinized for wasteful spending, and many organizations are constantly being asked to do more with less. Tacking on an additional fleet of super-computers to run algorithms is not practical. Rather, modest commodity hardware with typical specifications must be able to be employed as part of the solution.
- Generalizability: While highly specialized data science is often the best way to solve a specific information problem, the diversity of data in IT environments drives a need for something that can be broadly applicable across most use cases. Reusability of the same techniques is much more cost-effective in the long run.
- Adaptability: Ever-changing IT environments will quickly render a brittle algorithm useless in no time. Training and retraining the ML model would only introduce yet another time-wasting venture that cannot be afforded.
- Accuracy: We already know that alert fatigue from legacy threshold and rule-based systems is a real problem. Swapping one false alarm generator for another will not impress anyone.
- Ease of use: Even if all of the previously mentioned constraints could be satisfied, any solution that requires an army of data scientists to implement it would be too costly and would be disqualified immediately.
So, now we are getting to the real meat of the challenge—creating a fast, scalable, accurate, low-cost anomaly detection solution that everyone will use and love because it works flawlessly. No problem!
As daunting as that sounds, Prelert founder and CTO Steve Dodson took on that challenge back in 2010. While Dodson certainly brought his academic chops to the table, the technology that would eventually become Elastic ML had its genesis in the throes of trying to solve real IT application problems—the first being a pesky intermittent outage in a trading platform at a major London finance company. Dodson, and a handful of engineers who joined the venture, helped the bank's team use the anomaly detection technology to automatically surface only the needles in the haystacks that allowed the analysts to focus on the small set of relevant metrics and log messages that were going awry. The identification of the root cause (a failing service whose recovery caused a cascade of subsequent network problems that wreaked havoc) ultimately brought stability to the application and prevented the need for the bank to spend lots of money on the previous solution, which was an unplanned, costly network upgrade.
As time passed, however, it became clear that even that initial success was only the beginning. A few years and a few thousand real-world use cases later, the marriage of Prelert and Elastic was a natural one—a combination of a platform making big data easily accessible and technology that helped overcome the limitations of human analysis.
Fast forward to 2021, a full 5 years after the joining of forces, and Elastic ML has come a long way in the maturation and expansion of capabilities of the ML platform. This second edition of the book encapsulates the updates made to Elastic ML over the years, including the introduction of integrations into several of the Elastic solutions around observability and security. This second edition also includes the introduction of "data frame analytics," which is discussed extensively in the third part of the book. In order to get a grounded, innate understanding of how Elastic ML works, we first need to get to grips with some terminology and concepts to understand things further.
Unsupervised versus supervised ML
While there are many subtypes of ML, two very prominent ones (and the two that are relevant to Elastic ML) are unsupervised and supervised.
In unsupervised ML, there is no outside guidance or direction from humans. In other words, the algorithms must learn (and model) the patterns of the data purely on their own. In general, the biggest challenge here is to have the algorithms accurately surface detected deviations of the input data's normal patterns to provide meaningful insight for the user. If the algorithm is not able to do this, then it is not useful and is unsuitable for use. Therefore, the algorithms must be quite robust and able to account for all of the intricacies of the way that the input data is likely to behave.
In supervised ML, input data (often multivariate data) is used to help model the desired outcome. The key difference from unsupervised ML is that the human decides, a priori, what variables to use as the input and also provides "ground-truth" examples of the expected target variable. Algorithms then assess how the input variables interact and influence the known output target. To accurately get the desired output (prediction, for example), the algorithm must have "the right kind of data" not only that indeed expresses the situation, but also so that there is enough diversity of the input data in order to effectively learn the relationship between the input data and the output target.
As such, both cases require good input data, good algorithmic approaches, and a good mechanism to allow the ML to both learn the behavior of the data and apply that learning to assess subsequent observations of that data. Let's dig a little deeper into the specifics of how Elastic ML leverages unsupervised and supervised learning.
Using unsupervised ML for anomaly detection
To get a more intuitive understanding of how Elastic ML's anomaly detection works using unsupervised ML, we will discuss the following:
- A rigorous definition of unusual with respect to the technology
- An intuitive example of learning in an unsupervised manner
- A description of how the technology models, de-trends, and scores the data
Anomaly detection is something almost all of us have a basic intuition about. Humans are quite good at pattern recognition, so it should be of no surprise that if I asked 100 people on the street what's unusual in the following graph, a vast majority (including non-technical people) would identify the spike in the green line:
Similarly, let's say we ask what's unusual in the following photo:
We will, again, likely get a majority that rightly claims that the seal is the unusual thing. But people may struggle to articulate in salient terms the actual heuristics that are used in coming to those conclusions.
There are two different heuristics that we could use to define the different kinds of anomalies shown in these images:
- Something is unusual if its behavior has significantly deviated from an established pattern or range based upon its past history.
- Something is unusual if some characteristic of that entity is significantly different from the same characteristic of the other members of a set or population.
These key definitions will be relevant to Elastic ML's anomaly detection, as they form the two main fundamental modes of operation of the anomaly detection algorithms (temporal versus population analysis, as will be explored in Chapter 3, Anomaly Detection). As we will see, the user will have control over what mode of operation is employed for a particular use case.
Learning what's normal
As we've stated, Elastic ML's anomaly detection uses unsupervised learning in that the learning occurs without anything being taught. There is no human assistance to shape the decisions of the learning; it simply does so on its own, via inspection of the data it is presented with. This is slightly analogous to the learning of a language via the process of immersion, as opposed to sitting down with books of vocabulary and rules of grammar.
To go from a completely naive state where nothing is known about a situation to one where predictions could be made with good certainty, a model of the situation needs to be constructed. How this model is created is extremely important, as the efficacy of all subsequent actions taken based upon this model will be highly dependent on the model's accuracy. The model will need to be flexible and continuously updated based upon new information, because that is all that it has to go on in this unsupervised paradigm.
Probability distributions can serve this purpose quite well. There are many fundamental types of distributions (and Elastic ML uses a variety of distribution types, such as Poisson, Gaussian, log-normal, or even mixtures of models), but the Poisson distribution is a good one to discuss first, because it is appropriate in situations where there are discrete occurrences (the "counts") of things with respect to time:
There are three different variants of the distribution shown here, each with a different mean (λ) and the highest expected value of k. We can make an analogy that says that these distributions model the expected amount of postal mail that a person gets delivered to their home on a daily basis, represented by k on the x axis:
- For λ = 1, there is about a 37% chance that zero pieces or one piece of mail are delivered daily. Perhaps this is appropriate for a college student that doesn't receive much postal mail.
- For λ = 4, there is about a 20% chance that three or four pieces are received. This might be a good model for a young professional.
- For λ = 10, there is about a 13% chance that 10 pieces are received per day—perhaps representing a larger family or a household that has somehow found themselves on many mailing lists!
The discrete points on each curve also give the likelihood (probability) of other values of k. As such, the model can be informative and answer questions such as "Is getting 15 pieces of mail likely?" As we can see, it is not likely for a student (λ = 1) or a young professional (λ = 4), but it is somewhat likely for a large family (λ = 10). Obviously, there was a simple declaration made here that the models shown were appropriate for the certain people described—but it should seem obvious that there needs to be a mechanism to learn that model for each individual situation, not just assert it. The process for learning it is intuitive.
Learning the models
Sticking with the postal mail analogy, it would be instinctive to realize that a method of determining what model is the best fit for a particular household could be ascertained simply by hanging out by the mailbox every day and recording what the postal carrier drops into the mailbox. It should also seem obvious that the more observations made, the higher your confidence should be that your model is accurate. In other words, only spending 3 days by the mailbox would provide less complete information and confidence than spending 30 days, or 300 for that matter.
Algorithmically, a similar process could be designed to self-select the appropriate model based upon observations. Careful scrutiny of the algorithm's choices of the model type itself (that is, Poisson, Gaussian, log-normal, and so on) and the specific coefficients of that model type (as in the preceding example of λ) would also need to be part of this self-selection process. To do this, constant evaluation of the appropriateness of the model is done. Bayesian techniques are also employed to assess the model's likely parameter values, given the dataset as a whole, but allowing for tempering of those decisions based upon how much information has been seen prior to a particular point in time. The ML algorithms accomplish this automatically.
For those that want a deeper dive into some of the representative mathematics going on behind the scenes, please refer to the academic paper at http://www.ijmlc.org/papers/398-LC018.pdf.
Most importantly, the modeling that is done is continuous, so that new information is considered along with the old, with an exponential weighting given to information that is fresher. Such a model, after 60 observations, could resemble the following:
It will then seem very different after 400 observations, as the data presents itself with a slew of new observations with values between
Also, notice that there is the potential for the model to have multiple modes or areas/clusters of higher probability. The complexity and trueness of the fit of the learned model (shown as the blue curve) with the theoretically ideal model (in black) matters greatly. The more accurate the model, the better representation of the state of normal for that dataset and thus, ultimately, the more accurate the prediction of how future values comport with this model.
The continuous nature of the modeling also drives the requirement that this model is capable of serialization to long-term storage, so that if model creation/analysis is paused, it can be reinstated and resumed at a later time. As we will see, the operationalization of this process of model creation, storage, and utilization is a complex orchestration, which is fortunately handled automatically by Elastic ML.
Another important aspect of faithfully modeling real-world data is to account for prominent overtone trends and patterns that naturally occur. Does the data ebb and flow hourly and/or daily with more activity during business hours or business days? If so, this needs to be accounted for. Elastic ML automatically hunts for prominent trends in the data (linear growth, cyclical harmonics, and so on) and factors them out. Let's observe the following graph:
Here, the periodic daily cycle is learned, then factored out. The model's prediction boundaries (represented in the light-blue envelope around the dark-blue signal) dramatically adjust after automatically detecting three successive iterations of that cycle.
Therefore, as more data is observed over time, the models gain accuracy both from the perspective of the probability distribution function getting more mature, as well as via the auto-recognizing and de-trending of other routine patterns (such as business days, weekends, and so on) that might not emerge for days or weeks. In the following example, several trends are discovered over time, including daily, weekly, and an overall linear slope:
These model changes are recorded as system annotations. Annotations, as a general concept, will be covered in later chapters.
Scoring of unusualness
Once a model has been constructed, the likelihood of any future observed value can be found within the probability distribution. Earlier, we had asked the question "Is getting 15 pieces of mail likely?" This question can now be empirically answered, depending on the model, with a number between 0 (no possibility) and 1 (absolute certainty). Elastic ML will use the model to calculate this fractional value out to approximately 300 significant figures (which can be helpful when dealing with very low probabilities). Let's observe the following graph:
Here, the probability of the observation of the actual value of 921 is now calculated to be 1.444e-9 (or, more commonly, a mere 0.0000001444% chance). This very small value is perhaps not that intuitive to most people. As such, ML will take this probability calculation, and via the process of quantile normalization, re-cast that observation on a severity scale between 0 and 100, where 100 is the highest level of unusualness possible for that particular dataset. In the preceding case, the probability calculation of 1.444e-9 is normalized to a score of 94. This normalized score will come in handy later as a means by which to assess the severity of the anomaly for the purposes of alerting and/or triage.
The element of time
In Elastic ML, all of the anomaly detection that we will discuss throughout the rest of the book will have an intrinsic element of time associated with the data and analysis. In other words, for anomaly detection, Elastic ML expects the data to be time series data and that data will be analyzed in increments of time. This is a key point and also helps discriminate between anomaly detection and data frame analytics in addition to the unsupervised/supervised paradigm.
You will see that there's a slight nuance with respect to population analysis (covered in Chapter 3, Anomaly Detection) and outlier detection (covered in Chapter 10, Outlier Detection While they effectively both find entities that are distinctly different from their peers, population analysis in anomaly detection does so with respect to time, whereas outlier detection analysis isn't constrained by time. More will become obvious as these topics are covered in depth in later chapters.
Applying supervised ML to data frame analytics
With the exception of outlier detection (covered in Chapter 10, Outlier Detection which actually is an unsupervised approach, the rest of data frame analytics uses a supervised approach. Specifically, there are two main types of problems that Elastic ML data frame analytics allows you to address:
- Regression: Used to predict a continuous numerical value (a price, a duration, a temperature, and so on)
- Classification: Used to predict whether something is of a certain class label (fraudulent transaction versus non-fraudulent, and more)
In both cases, models are built using training data to map input variables (which can be numerical or categorical) to output predictions by training decision trees. The particular implementation used by Elastic ML is a custom variant of XGBoost, an open source gradient-boosted decision tree framework that has recently gained some notoriety among data scientists for its ability to allow them to win Kaggle competitions.
The process of supervised learning
The overall process of supervised ML is very different from the unsupervised approach. In the supervised approach, you distinctly separate the training stage from the predicting stage. A very simplified version of the process looks like the following:
Here, we can see that in the training phase, features are extracted out of the raw training data to create a feature matrix (also called a data frame) to feed to the ML algorithm and create the model. The model can be validated against portions of the data to see how well it did, and subsequent refinement steps could be made to adjust which features are extracted, or to refine the parameters of the ML algorithm used to improve the accuracy of the model's predictions.
Once the user decides that the model is efficacious, that model is "moved" to the prediction workflow, where it is used on new data. One at a time, a single new feature vector is inferenced against the model to form a prediction.
To get an intuitive sense of how this works, imagine a scenario in which you want to sell your house, but don't know what price to list it for. You research prior sales in your area and notice the price differentials for homes based on different factors (number of bedrooms, number of bathrooms, square footage, proximity to schools/shopping, age of home, and so on). Those factors are the "features" that are considered altogether (not individually) for every prior sale.
This corpus of historical sales is your training data. It is helpful because you know for certain how much each property sold for (and that's the thing you'd ultimately like to predict for your house). If you study this enough, you might get an intuition about how the prices of houses are driven strongly by some features (for instance, the number of bedrooms) and that other features (perhaps the age of the home) may not affect the pricing much. This is a concept called "feature importance" that will be visited again in a later chapter.
Armed with enough training data, you might have a good idea what the value of your home should be priced at, given that it is a three-bedroom, two-bath, 1,700-square-foot, 30-year-old home. In other words, you've constructed a model in your mind based on your research of comparable homes that have sold in the last year or so. If the past sales are the "training data," your home's specifications (bedrooms, bathrooms, and so on) are the feature vectors that will define the expected price, given your "model" that you've learned.
Your simple mental model is obviously not as rigorous as one that could be constructed with regression analysis using ML using dozens of relevant input features, but this simple analogy hopefully cements the idea of the process that is followed in learning from prior, known situations, and then applying that knowledge to a present, novel situation.
To summarize what we discussed in this chapter, we covered the genesis story of ML in IT—born out of the necessity to automate analysis of the massive, ever-expanding growth of collected data within enterprise environments. We also got a more intuitive understanding of the different types of ML in Elastic ML, which includes both unsupervised anomaly detection and supervised data frame analysis.
As we journey through the rest of the chapters, we will often be mapping the use cases of the problems we're trying to solve to the different modes of operation of Elastic ML.
Remember that if the data is a time series, meaning that it comes into existence routinely over time (metric/performance data, log files, transactions, and so on), it is quite possible that Elastic ML's anomaly detection is all you'll ever need. As you'll see, it is incredibly flexible and easy to use and accomplishes many use cases on a broad variety of data. It's kind of a Swiss Army knife! A large amount of this book (Chapters 3 through 8) will be devoted to how to leverage anomaly detection (and the ancillary capability of forecasting) to get the most out of your time series data that is in the Elastic Stack.
If you are more interested in finding unusual entities within a population/cohort (User/Entity Behavior), you might have a tricky decision between using population analysis in anomaly detection versus outlier detection in data frame analytics. The primary factor may be whether or not you need to do this in near real time—in which case you might likely choose population analysis. If near real time is not necessary and/or if you require the consideration of multiple features simultaneously, you would choose outlier detection. See Chapter 10, for more detailed information about the comparison and benefits of each approach.
That leaves many other use cases that require a multivariate approach to modeling. This would not only align with the previous example of real estate pricing but also encompass the use cases of language detection, customer churn analysis, malware detection, and so on. These will fall squarely in the realm of the supervised ML of data frame analytics and be covered in Chapters 11 through 13.
In the next chapter, we will get down and dirty with understanding how to enable Elastic ML and how it works in an operational sense. Buckle up and enjoy the ride!