Reader small image

You're reading from  Scalable Data Analytics with Azure Data Explorer

Product typeBook
Published inMar 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781801078542
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Jason Myerscough
Jason Myerscough
author image
Jason Myerscough

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environment to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first class master’s degree in computing.
Read more about Jason Myerscough

Right arrow

What is Azure Data Explorer?

There is a good chance you have already used ADX to some degree without realizing it. If you have used Azure Security Center, Azure Sentinel, Application Insights, Resource Graph Explorer, or enabled diagnostics on your Azure resources, then you have used ADX. All these services rely on Log Analytics, which is built on top of ADX.

Like many tools and products, ADX was started by a small group of engineers circa 2015 who were trying to solve a problem. A small group of developers from Microsoft's Power BI team needed a high-performing big data solution to ingest and analyze their logging and telemetry data, and being engineers, they built their own when they could not find a service that met their needs. This resulted in the creation of Azure Data Explorer, also known as Kusto.

So, what is ADX? It is a fully managed, append-only columnar store big data service capable of elastic scaling and ingesting literally hundreds of billions of records daily!

Before moving onto the ADX features, it is important to understand what is meant by PaaS and the other cloud offerings referred to as as a service. Understanding the different cloud offerings will help with understanding what you and the cloud provider – in our case, Microsoft – are responsible for.

When you strip away the marketing terms, cloud computing is essentially a data center that is managed for you and has the same layers or elements as an on-premises data center, for example, hardware, storage, and networking.

Figure 1.3 shows the common layers and elements of a data center. The items in white are managed by you, the customer, and the items in gray are managed by the cloud provider:

Figure 1.3 – Cloud offerings

Figure 1.3 – Cloud offerings

In the case of on-premises, you are responsible for everything, from renting the building and ventilation to physical networking and running your applications. Public cloud providers offer three fundamental cloud offerings, known as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The provider typically offers a lot more services, such as Azure App Service, but these additional services are built on top of the aforementioned fundamental services.

In the case of ADX, which is a PaaS service, Microsoft manages all layers except the data and application. You are responsible for the data layer, that is, the data ingestion, and the application layer, that is, writing our KQL and creating dashboards.

ADX features

Let's look at some of the key features ADX provides. Most of the features will be discussed in detail in later chapters:

  • Low-latency ingestion and elastic scaling: ADX nodes are capable of ingesting structured, semi-structured, and unstructured data up to speeds of 200 MBps (megabytes per second). The vertical and horizontal scaling capabilities of ADX enable it to ingest petabytes of data.
  • Time series analysis: As we will see in Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data, ADX supports near real-time monitoring, and combined with the powerful KQL, we can search for anomalies and trends within our data.
  • Fully managed (PaaS): All the infrastructure, operating system patching, and software updates are taken care of by Microsoft. You can focus on developing your product rather than running a big data platform. You can be up and running in three steps:
    • Create a cluster and database (more details in Chapter 2, Building Your Azure Data Explorer Environment).
    • Ingest data (more details in Chapter 4, Ingesting Data in Azure Data Explorer).
    • Explore your data using KQL (more details in Chapter 5, Introducing the Kusto Query Language).
  • Cost-efficient: Like other Azure services, Microsoft provides a pay-as-you-consume model. For more advanced use cases, there is also the option of purchasing reserved instances, which require upfront payments.
  • High availability: Microsoft provides an uptime SLA of 99.9% and supports Availability Zones, which ensures your infrastructure is deployed across multiple physical data centers within an Azure region.
  • Rapid ad hoc query performance: Due to some of the architecture decisions that are discussed in the next section, ADX is capable of querying billions of records containing structured, semi-structured, and unstructured data, returning results within seconds. ADX is also designed to execute distributed queries across multiple clusters, which we will see later in the book.
  • Security: We will be covering security in depth in Chapter 10, Azure Data Explorer Security. For now, suffice it to say that ADX supports both encryption at rest and in transit, role-based access control (RBAC), and allows you to restrict public access to your clusters by deploying them into virtual private networks (VPNs) and block traffic using network security groups (NSGs).
  • Enables custom solutions: Allows developers to build analytics services on top of ADX.

If you are familiar with database products such as MySQL, MS SQL Server, and Azure SQL, then the core components will be familiar to you. ADX uses the concept of clusters, which can be considered equivalent to Azure SQL Server and are essentially the compute or virtual machines. Next, we have databases and tables; these concepts are the same as a SQL database.

Figure 1.4 shows the hierarchical structure that is shown in the Data Explorer UI. In this example, help is the ADX cluster and Samples is in the database, which contains multiple tables such as US_States:

Figure 1.4 – Cluster, database, and tables hierarchy

Figure 1.4 – Cluster, database, and tables hierarchy

A cluster or SQL server can host multiple databases, which in turn can contain multiple tables (see Figure 1.4). We will discuss tables in Chapter 4, Ingesting Data in Azure Data Explorer, when we will demonstrate how to create tables and data mappings.

Introducing Azure Data Explorer architecture

PaaS services are great because they allow developers to get started quickly and focus on their product rather than managing complex infrastructure. Being fully managed can also be a disadvantage, especially when you experience issues and need to troubleshoot, and as engineers, we tend to be curious and want to understand how things work.

As depicted in Figure 1.5, ADX contains two key services, the data management service and the engine service. Both services are clusters of compute resources that can be automatically or manually scaled horizontally and vertically. At the time of writing, Microsoft recently announced their V3 engine in March 2021, which contains some significant performance improvements:

Figure 1.5 – Azure Data Explorer architecture

Figure 1.5 – Azure Data Explorer architecture

Now, let's learn more about the data management and the engine service depicted in the preceding diagram:

  • Data management service: The data management service is responsible primarily for metadata management and managing the data ingestion pipelines. The data management service ensures data is properly ingested and sent to the engine service. Data that is streamed to our cluster is sent to the row store, whereas data that is batched is sent to the column stores.
  • Engine service: The engine service, which is a cluster of compute resources, is responsible for processing the ingested data, managing the hot cache and the long-term storage, and query execution. Each engine uses its local SSD as the hot cache and ensures the cache is used as much as possible.

ADX is often referred to as an append-only analytics service, since the data that is ingested is stored in immutable shards and each shard is compressed for performance reasons. Data sharding is a method of splitting data into smaller chunks. Since the data is immutable, the engine nodes can safely read the data shards, knowing they do not have to worry about other nodes in the cluster making changes to the data.

Since the storage and the compute are decoupled, ADX can scale the cluster both vertically and horizontally without worrying too much about data management.

This brief overview of the architecture only scratches the surface; there are a lot more tasks happening, such as indexing columns and maintenance of the indexes. Having an overview helps appreciate what ADX is doing under the hood.

Important Note

I recommend reading the Azure Data Explorer white paper https://azure.microsoft.com/mediahandler/files/resourcefiles/azure-data-explorer/Azure_Data_Explorer_white_paper.pdf if you are interested in learning more about the architecture.

Previous PageNext Page
You have been reading a chapter from
Scalable Data Analytics with Azure Data Explorer
Published in: Mar 2022Publisher: PacktISBN-13: 9781801078542
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jason Myerscough

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environment to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first class master’s degree in computing.
Read more about Jason Myerscough