You're reading from Getting Started with Elastic Stack 8.0

Product type Book

Published in Mar 2022

Publisher Packt

ISBN-13 9781800569492

Pages 474 pages

Edition 1st Edition

Languages

Concepts

Enterprise Search

Author (1):

Asjad Athick

Table of Contents (18) Chapters

Preface

1. Section 1: Core Components

2. Chapter 1: Introduction to the Elastic Stack

3. Chapter 2: Installing and Running the Elastic Stack

4. Section 2: Working with the Elastic Stack

5. Chapter 3: Indexing and Searching for Data

6. Chapter 4: Leveraging Insights and Managing Data on Elasticsearch

7. Chapter 5: Running Machine Learning Jobs on Elasticsearch

8. Chapter 6: Collecting and Shipping Data with Beats

9. Chapter 7: Using Logstash to Extract, Transform, and Load Data

10. Chapter 8: Interacting with Your Data on Kibana

11. Chapter 9: Managing Data Onboarding with Elastic Agent

12. Section 3: Building Solutions with the Elastic Stack

13. Chapter 10: Building Search Experiences Using the Elastic Stack

14. Chapter 11: Observing Applications and Infrastructure Using the Elastic Stack

15. Chapter 12: Security Threat Detection and Response Using the Elastic Stack

16. Chapter 13: Architecting Workloads on the Elastic Stack

17. Other Books You May Enjoy

Collecting and ingesting data

So far, we have looked at Elasticsearch, a scalable search and analytics engine for all kinds of data. We have also got Kibana to interface with Elasticsearch to help us explore and use our data effectively. The final capability to make it all work together is ingestion.

The Elastic Stack provides two products for ingestion, depending on your use cases.

Collecting data from across your environment using Beats

Useful data is generated all over the place in present-day environments, often from varying technology stacks, as well as legacy and new systems. As such, it makes sense to collect data directly from, or closer to, the source system and ship it into your centralized logging or analytics platform. This is where Beats come in; Beats are lightweight applications (also referred to as agents) that can collect and ship several types of data to destinations such as Elasticsearch, Logstash, or Kafka.

Elastic offers a few types of Beats today for various use cases:

Filebeat: Collecting log data
Metricbeat: Collecting metric data
Packetbeat: Decoding and collecting network packet metadata
Heartbeat: Collecting system/service uptime and latency data
Auditbeat: Collecting OS audit data
Winlogbeat: Collecting Windows event, applicatio, and security logs
Functionbeat: Running data collection on serverless compute infrastructure such as AWS Lambda

Beats use an open source library called libbeat that provides generic APIs for configuring inputs and destinations for data output. Beats implement the data collection functionality that's specific to the type of data (such as logs and metrics) that they collect. A range of community-developed Beats are available, in addition to the officially produced Beats agents.

Beats modules and the Elastic Common Schema

The modules that are available in Beats allow you to collect consistent datasets and the distribution of out-of-the-box dashboards, machine learning jobs, and alerts for users to leverage in their use cases.

Importance of a unified data model

One of the most important aspects of ingesting data into a centralized logging platform is paying attention to the data format in use. A Unified Data Model (UDM) is an especially useful tool to have, ensuring data can be easily consumed by end users once ingested into a logging platform. Enterprises typically follow a mixture of two approaches to ensure the log data complies with their unified data model:

Enforcing a logging standard or specification for log-producing applications in the company.

This approach is often considerably costly to implement, maintain, and scale. Changes in the log schema at the source can also have unintended downstream implications in other applications consuming the data. It is common to see UDMs evolve quite rapidly as the nature and the content of the logs that have been collected change. The use of different technology stacks or frameworks in an organization can also make it challenging to log with consistency and uniformity across the environment.

Transforming/renaming fields in incoming data using an ETL tool such as Logstash to comply with the UDM. Organizations can achieve relatively successful results using this approach, with considerably fewer upfront costs when reworking logging formats and schemas. However, the approach does come with some downsides:

(a) Parsers need to be maintained and constantly updated to make sure the logs are extracted and stored correctly.

(b) Most of the parsing work usually needs to be done (or overlooked) by a central function in the organization (because of the centralized nature of the transformation), rather than by following a self-service or DevOps-style operating model.

Elastic Common Schema

The Elastic Common Schema (ECS) is a unified data model set by Elastic. The following ECS specifications have a few advantages over a custom or internal UDM:

ECS sets Elasticsearch index mappings for fields. This is important so that metric aggregations and ranges can be applied properly to data. Numeric fields such as the number of bytes received as part of a network request should be mapped as an integer value. This allows a visualization to sum the total number of bytes received over a certain period. Similarly, the HTTP status code field needs to be mapped as a keyword so that a visualization can count how many 500 errors the application encountered.
Out-of-the-box content such as dashboards, visualizations, machine learning jobs, and alerts can be used if your data is ECS-compliant. Similarly, you can consume content from and share it with the open source community.
You can still add your own custom or internal fields to ECS by following the naming conventions that have been defined as part of the ECS specification. You do not have to use just the fields that are part of the ECS specification.

Beats modules

Beats modules can automatically convert logs and metrics from various supported data sources into an ECS-compliant schema. Beats modules also ship with out-of-the-box dashboards, machine learning jobs, and alerts. This makes it incredibly easy to onboard a new data source onto Elasticsearch using a Beat, and immediately being able to consume this data as part of a value-adding use case in your organization. There is a growing list of supported Filebeat and Metricbeat modules available on the Elastic integration catalog.

Onboarding and managing data sources at scale

Managing a range of Beats agents can come with significant administrative overhead, especially in large and complex environments. Onboarding new data sources would require updating configuration files, which then need to be deployed on the right host machine.

Elastic Agent is a single, unified agent that can be used to collect logs, metrics, and uptime data from your environment. Elastic Agent orchestrates the core Beats agents under the hood but simplifies the deployment and configuration process, given teams now need to manage just one agent.

Elastic Agent can also work with a component called Fleet Server to simplify the ongoing management of the agents and the data they collect. Fleet can be used to centrally push policies to control data collection and manage agent version upgrades, without any additional administrative effort. We take look at Elastic Agent in more detail in Chapter 9, Managing Data Onboarding with Elastic Agent.

Centralized extraction and transformation and loading your data with Logstash

While Beats make it very convenient to onboard a new data source, they are designed to be lightweight in terms of performance footprint. As such, Beats do not provide a great deal of heavy processing, transformation, and enrichment capabilities. This is where Logstash comes in to help your ingestion architecture.

Logstash is a general-purpose ETL tool designed to input data from any number of source systems/communication protocols. The data is then processed through a set of filters, where you can mutate, add, enrich, or remove fields as required. Finally, events can be sent to several destination systems. This configuration is defined as a Logstash parser. We will dive deeper into Logstash and how it can be used for various ETL use cases in Chapter 7, Using Logstash to Extract, Transform, and Load Data.

Deciding between using Beats and Logstash

Beats and Logstash are designed to serve specific requirements when collecting and ingesting data. Users are often confused when deciding between Beats or Logstash when onboarding a new data source. The following list aims to make this clearer.

When to use Beats

Beats should be used when the following points apply to your use case:

When you need to collect data from a large number of hosts or systems from across your environment. Some examples are as follows:

(a) Collecting web logs from a dynamic group of hundreds of web servers

(b) Collecting logs from a large number of microservices running on a container orchestration platform such as Kubernetes

When there is a supported Beats module available.
When you do not need to perform a significant amount of transformation/processing before consuming data on Elasticsearch.
When consuming from a web source, you do not need to have scaling/throughput concerns in place for a single beat instance.

When to use Logstash

Logstash should be used when you have the following requirements:

When a large amount of data is consumed from a centralized location (such as a file share, AWS S3, Kafka, and AWS Kinesis) and you need to be able to scale ingestion throughput.
When you need to transform data considerably or parse complex schemas/codecs, especially using regular expressions or Grok.
When you need to be able to load balance ingestion across multiple Logstash instances.
When a supported Beats module is not available.

It is worth noting that Beats agents are continually updated and enhanced with every release. The gap between the capabilities of Logstash and Beats has closed considerably over the last few releases.

Using Beats and Logstash together

It is quite common for organizations to get the best of both worlds by leveraging Beats and Logstash together. This allows data to be collected from a range of sources while enabling centralized processing and transformation of events.

Now that we understand how we can ingest data into the Elastic Stack, let's look at the options that are available when running the stack.