Reader small image

You're reading from  Scalable Data Analytics with Azure Data Explorer

Product typeBook
Published inMar 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781801078542
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Jason Myerscough
Jason Myerscough
author image
Jason Myerscough

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environment to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first class master’s degree in computing.
Read more about Jason Myerscough

Right arrow

Chapter 4: Ingesting Data in Azure Data Explorer

In the previous chapters, we created our Azure Data Explorer (ADX) clusters and databases, learned how to use the Data Explorer Web UI, executed our first Kusto Query Language (KQL) query. Now, we are ready to look at data ingestion in detail and start to ingest data. Data ingestion is the process of taking data from an external source and importing it into your big data solution, in our case, ADX. As you will soon see, once the data has been ingested, we can begin to analyze the data and generate visuals such as graphs and reports.

In this chapter, we will introduce data ingestion and discuss the different types of data (structured, semi-structured, and unstructured). Then we will examine the different data ingestion methods that ADX supports and learn how ADX ingests data via its Data management service, which we introduced in Chapter 1, Introducing Azure Data Explorer.

Next, we will learn about schema mapping, which is the process...

Technical requirements

The code examples for this chapter can be found in the Chapter04 folder of the repo: https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer.git. The Chapter04 directory contains two directories, templates/, which contains our ARM templates, and datasets/, which contains our datasets that we will be ingesting.

One of the challenges when it comes to writing about data analytics is to have interesting datasets that are large enough to demonstrate the features of ADX and KQL. In this chapter, we will use the English Premier League's results to demonstrate how to ingest data in CSV and JSON format. A copy of the data is included in our repository and the original dataset can be found at https://datahub.io/sports-data/english-premier-league. The dataset provides Premier League results for the last 10 years.

Note

The infrastructure that we will deploy here will be reused later in the book. Feel free to either preserve the resources...

Understanding data ingestion

Before learning how data ingestion works with ADX, let's revisit the different types of data:

  • Structured data: When we think of structured data, we think of relational databases that are made up of tables consisting of rows and columns. Each column has a data type such as an integer or string, and it sometimes includes additional constraints such as fixed-length strings and strings with specific formats such as a postcode.
  • Semi-structured: When we think of semi-structured data, we think of JSON and XML. They have a structure defined with tags, but the format is typically less rigid than relational databases.
  • Unstructured data: Unstructured data is data that has no constraints, such as SMS messages, text files, and emails, and social media such as status posts, messages, and images.

As shown in Figure 4.1, ADX supports four categories of services that enable data ingestion:

Figure 4.1 – Data analysis pipeline

Figure 4.1 – Data...

Introducing schema mapping

As you know, before we can ingest any data into our ADX instance, we need to create tables in our database to store the data. Similar to a SQL database, ADX tables are two-dimensional, meaning they consist of rows and columns. When we create tables, we need to declare the column name and data type. A data type refers to the type of data a column can store, such as strings, dates, and numbers. We will create our own tables later in the chapter.

How do we ensure the data we are ingesting is imported into the correct tables and rows? The destination table is specified during the data connection creation and the columns are mapped to the incoming data using schema maps.

As you will see in the section Ingesting data from Blob storage using Azure Event Grid, it is possible for the source file to contain more columns than you are interested in. We will take a data source with over 60 columns and create a schema map to ingest only the columns we are interested...

Ingesting data using one-click ingestion

In this section, we are going to learn how to ingest data using the Data Explorer Web UI and from an Azure Storage account using the one-click ingestion method.

Note

If you have not already cloned the Git repository, please do so now, so you can follow the example. The repository can be found here: https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer.git.

Data ingestion is a three-step process:

  1. Ingestion Preparation: During the preparation phase, the table and mapping schemas are created.
  2. Ingestion: The file is then pulled from the queue, which is temporarily stored on an internal storage account, https://9qwkstrldmyerscoughadx01.blob.core.windows.net/20210614-ingestdata-e5c334ee145d4b43a3a2d3a96fbac1df-0/1623671437639_season-1819_csv.csv, and then ingested.
  3. Data Preview: Once the data has been ingested, it can be previewed and is ready for you to begin querying.

The following section...

Ingesting data using KQL management commands

In the previous section, we imported our English Premier League data and you may have noticed that over half of the columns were related to betting statistics. In this section, we will create a custom CSV mapping schema and exclude those columns.

We will also introduce some KQL management commands. Like SQL, KQL has two categories of commands – data and management. The data commands allow us to query our data and the management commands allow us to manage our clusters, databases, tables, and schemas. We will cover KQL in depth in the next chapter, Introducing Kusto Query Language.

The first step is to create a table with the columns that we are interested in. When creating tables, we use the .create table command.

We will now specify our columns and their data types as shown in the following code snippet. Here, we are creating a table with clear column names and are not including any of the betting statistics. You may have...

Ingesting data from Blob storage using Azure Event Grid

In our final example of data ingestion, we will enable streaming on our cluster and use Azure Event Grid and Event Hubs so we can ingest data whenever new files are placed in our storage account's blob container. A blob container is a location on the storage account used to store our data.

For this section, we need to create the following Azure resources:

  • A storage account for storing files
  • An event grid to emit blob creation events
  • An event hub deliver the notification to Azure Data Explorer

Using JSON data, we will demonstrate how to create JSON-based mapping schemas.

When a file is uploaded to the storage account, a blob created event is generated and received by the event grid. The event grid then updates Azure Data Explorer to pull information from the storage account. In our example, the information is a JSON file.

This path of data ingestion is shown in the following figure:

...

Summary

This chapter covered a lot of topics to ingest… I mean digest – pardon the pun. We started by learning about data ingestion in general, discussing the different types of data, such as structured, semi-structured, and unstructured. We then looked at the data management service in more detail to understand its role with regard to data ingestion. We also looked at the difference between batching and streaming data. We introduced the main ingestion categories: SDKs, managed pipelines such as Azure Event Grid, connections and plugins, and tools such as Azure Data Factory and one-click ingestion.

Then we learned about schema mapping, how they map external data to the columns in our ADX tables, and how to write our own schema maps for both CSV and JSON data. We created two schema maps for the English Premier League football results data, one for the CSV-based data and one for the JSON data, where we excluded the betting information and kept the actual football match...

Questions

Before moving on to the next chapter, test your knowledge by trying these exercises. The answers can be found at the back of the book.

  1. Which is the preferred ingestion method – streaming or batch ingestion?
  2. Try to import the StormEvents CSV file using the one-click ingestion method. We will use the StormEvents table in the next chapter.
  3. Try to upload another Premier league JSON file.
  4. Update EPL_Custom_JSON_Mapping and include the referee.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scalable Data Analytics with Azure Data Explorer
Published in: Mar 2022Publisher: PacktISBN-13: 9781801078542
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jason Myerscough

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environment to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first class master’s degree in computing.
Read more about Jason Myerscough