A decade ago, the idea of using machine learning (ML)-based technology in IT operations or IT security seemed a little like science fiction. Today, however, it is one of the most common buzzwords used by software vendors. Clearly, there has been a major shift in both the perception of the need for the technology and the capabilities that the state-of-the-art implementations of the technology can bring to bear. This evolution is important to understand to fully appreciate how Elastic's ML came to be and what problems it was designed to solve.
This chapter is dedicated to reviewing the history and concepts behind how Elastic's ML works. If you are uninterested and want to jump right into the installation and usage of the product, feel free to skip to Chapter 2, Installing the Elastic Stack with ML.
IT application support specialists and application architects have a demanding job with high expectations. Not only are they tasked with moving new and innovative projects into place for the business, but they also have to also keep currently deployed applications up and running as smoothly as possible. Today's applications are significantly more complicated than ever before—they are highly componentized, distributed, and possibly virtualized. They could be developed using Agile, or by an outsourced team. Plus, they are most likely constantly changing. Some DevOps teams claim they can typically make more than a hundred changes per day to a live production system. Trying to understand a modern application's health and behavior is like a mechanic trying to inspect an automobile while it is moving.
IT security operations analysts have similar struggles in keeping up with day-to-day operations, but they obviously have a different focus of keeping the enterprise secure and mitigating emerging threats. Hackers, malware, and rogue insiders have become so ubiquitous and sophisticated that the prevailing wisdom is that there is no longer a question of if an organization will be compromised—it's more of a question of when they will find out about it. Clearly, knowing about it as early as possible (before too much damage is done) is much more preferable than learning about it for the first time from law enforcement or the evening news.
So, how can they be helped? Is the crux of the problem that application experts and security analysts lack access to data to help them do their job effectively? Actually, in most cases, it is the exact opposite. Many IT organizations are drowning in data.
IT departments have invested in monitoring tools for decades and it is not uncommon to have a dozen or more tools actively collecting and archiving data that can be measured in terabytes, or even petabytes, per day. The data can range from rudimentary infrastructure- and network-level data to deep diagnostic data and/or system and application log files. Business-level key performance indicators (KPIs) could also be tracked, sometimes including data about the end user's experience. The sheer depth and breadth of data available, in some ways, is the most comprehensive that it has ever been.
To detect emerging problems or threats hidden in that data, there have traditionally been several main approaches to distilling the data into informational insights:
- Filter/search: Some tools allow the user to define searches to help trim down the data into a more manageable set. While extremely useful, this capability is most often used in an ad hoc fashion once a problem is suspected. Even then, the success of using this approach usually hinges on the ability for the user to know what they are looking for and their level of experience—both with prior knowledge of living through similar past situations and expertise in the search technology itself.
- Visualizations: Dashboards, charts, and widgets are also extremely useful to help us understand what data has been doing and where it is trending. However, visualizations are passive and require being watched for meaningful deviations to be detected. Once the number of metrics being collected and plotted surpasses the number of eyeballs available to watch them (or even the screen real estate to display them), visual-only analysis becomes less and less useful.
- Thresholds/rules: To get around the requirement of having data be physically watched in order for it to be proactive, many tools allow the user to define rules or conditions that get triggered upon known conditions or known dependencies between items. However, it is unlikely that you can realistically define all appropriate operating ranges or model all of the actual dependencies in today's complex and distributed applications. Plus, the amount and velocity of changes in the application or environment could quickly render any static rule set useless. Analysts found themselves chasing down many false positive alerts, setting up a boy who cried wolf paradigm that led to resentment of the tools generating the alerts and skepticism to the value that alerting could provide.
Ultimately, there needed to be a different approach—one that wasn't necessarily a complete repudiation of past techniques, but one that could bring a level of automation and empirical augmentation of the evaluation of data in a meaningful way. Let's face it, humans are imperfect—we have hidden biases, limitations of capacity for remembering information, and we are easily distracted and fatigued. Algorithms, if done correctly, can easily make up for these shortcomings.
ML, while a very broad topic that encompasses everything from self-driving cars to game-winning computer programs, was a natural place to look for a solution. If you realize that the majority of the requirements of effective application monitoring or security threat hunting are merely variations on the theme of find me something that is different than normal, then the discipline of anomaly detection emerges as the natural place to begin using ML techniques to solve these problems for IT professionals.
The science of anomaly detection is certainly nothing new, however. Many very smart people have researched and employed a variety of algorithms and techniques for many years. However, the practical application of anomaly detection for IT data poses some interesting constraints that makes the otherwise academically-worthy algorithms inappropriate for the job. These include the following:
- Timeliness: Notification of an outage, breach, or other significant anomalous situation should be known as quickly as possible in order to mitigate it. The cost of downtime or the risk of a continued security compromise is minimized if remedied or contained quickly. Algorithms that cannot keep up with the real-time nature of today's IT data have limited value.
- Scalability: As mentioned earlier, the volume, velocity, and variation of IT data continues to explode in modern IT environments. Algorithms that inspect this vast data must be able to scale linearly with the data to be usable in a practical sense.
- Efficiency: IT budgets are often highly scrutinized for wasteful spending, and many organizations are constantly being asked to do more with less. Tacking on an additional fleet of super-computers to run algorithms is not practical. Rather, modest commodity hardware with typical specifications must be able to be employed as part of the solution.
- Applicability: While highly specialized data science is often the best way to solve a specific information problem, the diversity of data in IT environments drive a need for something that can be broadly applicable across the vast majority of use cases. Reusability of the same techniques is much more cost-effective in the long run.
- Adaptability: Ever-changing IT environments will quickly render a brittle algorithm useless in no time. Training and retraining the ML model would only introduce yet another time-wasting venture that cannot be afforded.
- Accuracy: We already know that alert fatigue from legacy threshold and rule-based systems is a real problem. Swapping one false alarm generator for another will not impress anyone.
- Ease of use: Even if all of the previously mentioned constraints could be satisfied, any solution that requires an army of data scientists to implement it would be too costly and would be disqualified immediately.
So, now we are getting to the real meat of the challenge—creating a fast, scalable, accurate, low-cost anomaly detection solution that everyone will use and love because it works flawlessly. No problem!
As daunting as that sounds, Prelert Founder and CTO Steve Dodson took on that challenge back in 2010. While Steve certainly brought his academic chops to the table, the technology that would eventually become Elastic's X-Pack ML had its genesis in the throes of trying to solve real IT application problems—the first being a pesky intermittent outage in a trading platform at a major London finance company. Steve, and a handful of engineers who joined the venture, helped the bank's team use the anomaly detection technology to automatically surface only the needles in the haystacks that allowed the analysts to focus on the small set of relevant metrics and log messages that were going awry. The identification of the root cause (a failing service whose recovery caused a cascade of subsequent network problems that wreaked havoc) ultimately brought stability to the application and prevented the need for the bank to spend lots of money on the prior solution, which was an unplanned, costly network upgrade.
As time passed, however, it became clear that even that initial success was only the beginning. A few years and a few thousand real-world use cases later, the marriage of Prelert and Elastic was a natural one—a combination of a platform making big data easily accessible with technology that helped overcome the limitations of human analysis.
What is described in this text is the theory and operation of the technology in Elastic ML as of version 6.5.
To get a more intrinsic understanding of how the technology works, we will discuss the following:
- A rigorous definition of unusual with respect to the technology
- An intuitive example of learning in an unsupervised manner
- A description of how the technology models, de-trends, and scores the data
Anomaly detection is something almost all of us have a basic intuition on. Humans are quite good at pattern recognition, so it should be of no surprise that if I asked a hundred people on the street "what's unusual?" in the following graph, a vast majority (including non-technical people) would identify the spike in the green line:
Similarly, let's say we asked "what's unusual?" using the following picture:
We will, again, likely get a majority that rightly claim that the seal is the unusual thing. But, people may struggle to articulate in salient terms the actual heuristics that are used in coming to those conclusions.
In the first case, the heuristic used to define the spike as unusual could be stated as follows:
- Something is unusual if its behavior has significantly deviated from an established pattern or range based upon its past history
In the second case, the heuristic takes the following form:
- Something is unusual if some characteristic of that entity is significantly different than the same characteristic of the other members of a set or population
These key definitions will be relevant to Elastic ML, as they form the two main fundamental modes of operation of the anomaly detection algorithms. As we will see, the user will have control over what mode of operation is employed for a particular use case.
ML—the discipline—has many variations and techniques of the process of learning. ML—the feature in the Elastic Stack—uses a specific type, called unsupervised learning. The main attribute of unsupervised learning is that the learning occurs without anything being taught. There is no human assistance to shape the decisions of the learning; it simply does so on its own via inspection of the data it is presented with. This is slightly analogous to the learning of a language via the process of immersion, as opposed to sitting down with books of vocabulary and rules of grammar.
To go from a completely naive state where nothing is known about a situation to one where predictions could be made with good certainty, a model of the situation needs to be constructed. How this model is created is extremely important, as the efficacy of all subsequent actions taken based upon this model will be highly dependent on the model's accuracy. The model will need to be flexible and continuously updated based upon new information, because that is all that it has to go on in this unsupervised paradigm.
Probability distributions can serve this purpose quite well. There are many fundamental types of distributions, but the Poisson distribution is a good one to discuss first because it is appropriate in situations where there are discrete occurrences of things with respect to time:
There are three different variants of the distribution shown here, each with a different mean (λ), and the highest expected value of k. We can make an analogy that says that these distributions model the expected amount of postal mail that a person gets delivered to their home on a daily basis, represented by k on the x axis:
- For λ = 1, there is about a 37% chance that zero pieces or one piece of mail is delivered daily. Perhaps this is appropriate for a college student that doesn't receive much postal mail.
- For λ = 4, there is about a 20% chance that three or four pieces are received. Seemingly, this is a good model for a young professional.
- For λ = 10, there is about a 13% chance that 10 pieces are received per day—perhaps representing a larger family or at least a household that has somehow found themselves on many mailing lists!
The discrete points on each curve also give the likelihood (probability) of other values of k. As such, the model can be informative and answer questions such as "Is getting fifteen pieces of mail likely?". As we can see, it is not likely for the student (λ = 1) or the young professional (λ = 4), but it is somewhat likely for the large family (λ = 10).
Obviously, there was a simple declaration made here that the models shown were appropriate for the certain people described—but it should seem obvious that there needs to be a mechanism to learn that model for each individual situation, not just assert it. The process for learning it is intuitive.
Sticking with the postal mail analogy, it would be instinctive to realize that a method of determining what model is the best fit for a particular household could be ascertained simply by hanging out by the mailbox every day and recording what the postal carrier drops into the mailbox. It should also seem obvious that the more observations seen, the higher your confidence should be that your model is accurate. In other words, only spending 3 days by the mailbox would provide less complete information and confidence than spending 30 days, or 300 for that matter.
Algorithmically, a similar process could be designed to self-select the appropriate model based upon observations. Careful scrutiny of the algorithm's choices of the model type itself (that is, Poisson, Gaussian, log-normal, and so on) and the specific coefficients of that model type (as in the preceding example of λ) would also need to be part of this self-selection process. To do this, constant evaluation of the appropriateness of the model is done. Bayesian techniques are also employed to assess the model's likely parameter values, given the dataset as a whole, but allowing for tempering of those decisions based upon how much information has been seen prior to a particular point in time. The ML algorithms accomplish this automatically.
Most importantly, the modeling that is done is continuous, so that new information is considered along with the old, with an exponential weighting to the information that is fresher. Such a model, after 60 observations, could resemble the following:
It will then seem much different after 400 observations, as the data presents itself with a slew of new observations with values between 5 and 10:
Also notice that there is the potential for the model to have multiple modes, or areas/clusters of higher probability. The complexity and trueness of the fit of the learned model (shown as the blue curve) with the theoretically ideal model (in black) matters greatly. The more accurate the model, the better representation of the state of normal for that dataset, and thus ultimately, the more accurate the prediction of how future values comport with this model.
The continuous nature of the modeling also drives the requirement that this model be capable of serialization to long-term storage, so that if model creation/analysis is paused, it can be reinstated and resumed at a later time. As we will see, the operationalization of this process of model creation, storage, and utilization is a complex orchestration, which is fortunately handled automatically by ML.
Another important aspect of faithfully modeling real-world data is to account for prominent overtone trends and patterns that naturally occur. Does the data ebb and flow hourly and/or daily with more activity during business hours or business days? If so, then this needs to be accounted for. ML automatically hunts for prominent trends in the data (linear growth, cyclical harmonics, and so on), and factors them out. Let's observe the following graph:
Here, the periodic daily cycle is learned, then factored out. The model's prediction boundaries (represented in the light blue envelope around the dark blue signal) dramatically adjusts after automatically detecting three successive iterations of that cycle.
Therefore, as more data is observed over time, the models gain accuracy both from the perspective of the probability distribution function getting more mature, but also via the de-trending of other patterns that might not emerge for days or weeks.
Once a model has been constructed, the likelihood of any future observed value can be found within the probability distribution. As described earlier, we had asked the question, "Is getting fifteen pieces of mail likely?". This question can now be empirically answered, depending on the model, with a number between zero (no possibility) and one (absolute certainty). ML will use the model to calculate this fractional value out to approximately 300 significant figures (which can be helpful when dealing with very low probabilities). Let's observe the following graph:
Here, the probability of the observation of the actual value of 921 at this point in time was calculated to be 6.3634e-7 (or more commonly a mere 0.000063634% chance). This very small value is perhaps not that intuitive to most people. As such, ML will take this probability calculation, and via a process of quantile normalization, re-cast that observation on a severity scale between 0 and 100, where 100 is the highest level of unusualness possible for that particular dataset. In the preceding case, the probability calculation of 6.3634e-7 was normalized to a score of 94. This normalized score will come in handy later as a means by which to assess the severity of the anomaly for purposes of alerting and/or triage.
While Chapter 2, Installing the Elastic Stack with Machine Learning, will focus on the installation and setup of the product itself, it is good to understand a few key concepts of how ML works from a logistical perspective—where things run and when—and which processes and indices are involved in this complex orchestration.
In Elastic's ML, the job is the unit of work, similar to what a watch is for Elastic's alerting. As we will see in more depth later, the main configuration elements of a job are as follows:
- Job name/ID
- Analysis bucketization window (the Bucket span)
- The definition and settings for the query to obtain the raw data to be analyzed (the datafeed)
- The anomaly detection configuration recipe (the Detector)
ML jobs are independent and autonomous. Multiples can be running at once, doing independent things and analyzing data from different indices. Jobs can analyze historical data, real-time data, or a mixture of the two. Jobs can be created using the Machine Learning UI in Kibana, or programmatically via the API. They also require ML-enabled nodes.
First and foremost, since Elasticsearch is, by nature, a distributed multi-node solution, it is only natural that the ML feature of the Elastic Stack works as a native plugin that obeys many of the same operational concepts. As described in the documentation, ML can be enabled on any or all nodes, but it is a best practice in a production system to have dedicated ML nodes. This is helpful to optimize the types of resources specifically required by ML. Unlike data nodes that are involved in a fair amount of I/O load due to indexing and searching, ML nodes are more compute and memory intensive. With this knowledge, you can size the hardware appropriately for dedicated ML nodes.
One key thing to note—the ML algorithms do not run in the JVM. They are C++-based executables that will use the RAM that is left over from whatever is allocated for the Java Virtual Machine (JVM) heap. When running a job, the main process that invokes the analysis (called autodetect) can be seen in the process list:
View of top processes when a ML job is running
There will be one autodetect process for every actively running ML job. In multi-node setups, ML will distribute the jobs to each of the ML-enabled nodes to balance the load of the work.
Bucketing input data is an important concept to understand in ML. Set with a key parameter at the job level called bucket_span, the input data from the datafeed (described next) is collected into mini batches for processing. Think of the bucket span as a pre-analysis aggregation interval—the window of time in which a portion of the data is aggregated over for the purposes of analysis. The shorter the duration of the bucket_span, the more granular the analysis, but also the higher the potential for noisy artifacts in the data.
The following graph shows the same dataset aggregated over three different intervals:
Notice that the prominent anomalous spike seen in the version aggregated over the 5-minute interval becomes all but lost if the data is aggregated over a 60-minute interval due to the fact of the spike's short (<2 minute) duration. In fact, at this 60-minute interval, the spike doesn't even seem that anomalous anymore.
This is a practical consideration for the choice of bucket_span. On one hand, having a shorter aggregation period is helpful because it will increase the frequency of the analysis (and thus reduce the interval of notification on if there is something anomalous), but making it too short may highlight features in the data that you don't really care about. If the brief spike that's shown in the preceding data is a meaningful anomaly for you, then the 5-minute view of the data is sufficient. If, however, a perturbation of the data that's very brief seems like an unnecessary distraction, then avoid a low value of bucket_span.
ML obviously needs data to analyze (and use to build and mature the statistical models). This data comes from your time series indices in Elasticsearch. The datafeed is the mechanism by which this data is retrieved (searched) on a routine basis and presented to the ML algorithms. Its configuration is mostly obscured from the user, except in the case of the creation of an advanced job in the UI (or by using the ML API). However, it is important to understand what the datafeed is doing behind the scenes.
Similar to the concept of a watch input in alerting, the datafeed will routinely query for data against the index, which contains the data to be analyzed. How often the data (and how much data at a time) the datafeed queries depends on a few factors:
- bucket_span: We have already established that bucket_span controls the width of the ongoing analysis window. Therefore, the job of the datafeed is to make sure that the buckets are full of chronologically ordered data. You can therefore see that the datafeed will make a date range query to Elasticsearch.
- frequency: A parameter that controls how often the raw data is physically queried. If this is between 2 and 20 minutes, frequency will equal bucket_span (as in, query every 5 minutes for the last 5 minutes' worth of data). If the bucket_span is longer, the frequency, by default, will be a smaller number (more frequent) so that the overall long interval is not expected to be queried all at once. This is helpful if the dataset is rather voluminous. In other words, the interval of a long bucket_span will be chopped up into smaller intervals simply for the purposes of querying.
- query_delay: This controls the amount of time "behind now" that the datafeed should query for a bucket span's worth of data. The default is 60s. Therefore, with a bucket_span value of 5m and a query_delay value of 60s at 12:01 PM, the datafeed will request data in the range of 11:55 AM to midnight. This extra little delay allows for delays in the ingest pipeline to ensure no data is excluded from the analysis if its ingestion is delayed for any reason.
- scroll_size: In most cases, the type of search that the datafeed executes to Elasticsearch uses the scroll API. Scroll size defines how much the datafeed queries to Elasticsearch at a time. For example, if the datafeed is set to query for log data every 5 minutes, but in a typical 5-minute window there are 1 million events, the idea of scrolling that data means that not all 1 million events will be expected to be fetched with one giant query. Rather, it will do it with many queries in increments of scroll_size. By default, this scroll size is set conservatively to 1,000. So, to get 1 million records returned to ML, the datafeed will ask Elasticsearch for 1,000 rows, a thousand times. Increasing scroll_size to 10,000 will make the number of scrolls be reduced to a hundred. In general, beefier clusters should be able to handle a larger scroll_size and thus be more efficient in the overall process.
There is an exception, however, in the case of a single metric job. The single metric job (described more later) is a simple ML job that allows only one time series metric to be analyzed. In this case, the scroll API is not used to obtain the raw data—rather, the datafeed will automatically create a query aggregation (using the date_histogram aggregation). This aggregation technique can also be used for an advanced job, but it currently requires direct editing of the job's JSON configuration and should be reserved for expert users.
For Elastic's ML to function, there are several supporting indices that exist and serve specific purposes. We will look at the following indices and describe their roles:
The .ml-state index is the place where ML keeps the internal information about the statistical models that have been learned for a specific dataset, plus additional logistical information. This index is not meant to be understandable by a user—it is the backend algorithms of ML that will read and write entries in this index.
Information in the .ml-state index is compressed and is a small fraction of the size of the raw data that the ML jobs are analyzing.
The .ml-notifications index stores the audit messages for ML that appear in the Job messages section of the Job Management page of the UI:
Audit messages for a particular job in the ML UI
These messages convey the basic information about the job's creation and activity. Additionally, basic operational errors can be found here. Detailed information about the execution of ML jobs, however, is found in the elasticsearch.log file.
The .ml-anomalies-* indices contain the detailed results of ML jobs. There is a single .ml-anomalies-shared index that can contain information from multiple jobs (keyed with the job_id field). If the user chooses to Use a dedicated index in the user interface when creating a job (or sets the results_index_name when using the API), then a dedicated results index for that job will be created.
These indices are instrumental in leveraging the output of the ML algorithms. All information displayed in the ML UI will be driven from this result data. Additionally, proactive alerting on anomalies will be accomplished by having watches configured against these indices. More information on this will be presented in Chapter 6, Alerting on ML Analysis.
ML sequences all of these pieces together when an ML job is configured to run. A simplified version of this process is shown in the following diagram:
In general, the preceding procedures are done once per bucket_span—however, additional optimizations are done to minimize I/O. Those details are beyond the scope of this book. The key takeaway, however, is that this orchestration enables ML to be online (that is, not offline/batch) and constantly learning on newly ingested data. This process is also automatically handled by ML so that the user doesn't have to worry about the complex logistics required to make it all happen.
Now that there is an understanding of both the theoretical and practical operation of Elastic's ML, we can now focus our efforts on getting it properly installed and applying it to different use cases. The following chapters will lead us on the journey of solving some real-world problems in IT operations and IT security with Elastic's state-of-the-art automated anomaly detection.