You're reading from Scalable Data Analytics with Azure Data Explorer

Product typeBook

Published inMar 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781801078542

Edition1st Edition

Languages

Python

Concepts

Big Data

Author (1)

Jason Myerscough

Chapter 4: Ingesting Data in Azure Data Explorer

In the previous chapters, we created our Azure Data Explorer (ADX) clusters and databases, learned how to use the Data Explorer Web UI, executed our first Kusto Query Language (KQL) query. Now, we are ready to look at data ingestion in detail and start to ingest data. Data ingestion is the process of taking data from an external source and importing it into your big data solution, in our case, ADX. As you will soon see, once the data has been ingested, we can begin to analyze the data and generate visuals such as graphs and reports.

In this chapter, we will introduce data ingestion and discuss the different types of data (structured, semi-structured, and unstructured). Then we will examine the different data ingestion methods that ADX supports and learn how ADX ingests data via its Data management service, which we introduced in Chapter 1, Introducing Azure Data Explorer.

Next, we will learn about schema mapping, which is the process...

Technical requirements

The code examples for this chapter can be found in the Chapter04 folder of the repo: https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer.git. The Chapter04 directory contains two directories, templates/, which contains our ARM templates, and datasets/, which contains our datasets that we will be ingesting.

One of the challenges when it comes to writing about data analytics is to have interesting datasets that are large enough to demonstrate the features of ADX and KQL. In this chapter, we will use the English Premier League's results to demonstrate how to ingest data in CSV and JSON format. A copy of the data is included in our repository and the original dataset can be found at https://datahub.io/sports-data/english-premier-league. The dataset provides Premier League results for the last 10 years.

Note

The infrastructure that we will deploy here will be reused later in the book. Feel free to either preserve the resources...

Understanding data ingestion

Before learning how data ingestion works with ADX, let's revisit the different types of data:

Structured data: When we think of structured data, we think of relational databases that are made up of tables consisting of rows and columns. Each column has a data type such as an integer or string, and it sometimes includes additional constraints such as fixed-length strings and strings with specific formats such as a postcode.
Semi-structured: When we think of semi-structured data, we think of JSON and XML. They have a structure defined with tags, but the format is typically less rigid than relational databases.
Unstructured data: Unstructured data is data that has no constraints, such as SMS messages, text files, and emails, and social media such as status posts, messages, and images.

As shown in Figure 4.1, ADX supports four categories of services that enable data ingestion:

Figure 4.1 – Data...

Introducing schema mapping

As you know, before we can ingest any data into our ADX instance, we need to create tables in our database to store the data. Similar to a SQL database, ADX tables are two-dimensional, meaning they consist of rows and columns. When we create tables, we need to declare the column name and data type. A data type refers to the type of data a column can store, such as strings, dates, and numbers. We will create our own tables later in the chapter.

How do we ensure the data we are ingesting is imported into the correct tables and rows? The destination table is specified during the data connection creation and the columns are mapped to the incoming data using schema maps.

As you will see in the section Ingesting data from Blob storage using Azure Event Grid, it is possible for the source file to contain more columns than you are interested in. We will take a data source with over 60 columns and create a schema map to ingest only the columns we are interested...

Ingesting data using one-click ingestion

In this section, we are going to learn how to ingest data using the Data Explorer Web UI and from an Azure Storage account using the one-click ingestion method.

Note

If you have not already cloned the Git repository, please do so now, so you can follow the example. The repository can be found here: https://github.com/PacktPublishing/Scalable-Data-Analytics-with-Azure-Data-Explorer.git.

Data ingestion is a three-step process:

Ingestion Preparation: During the preparation phase, the table and mapping schemas are created.
Ingestion: The file is then pulled from the queue, which is temporarily stored on an internal storage account, https://9qwkstrldmyerscoughadx01.blob.core.windows.net/20210614-ingestdata-e5c334ee145d4b43a3a2d3a96fbac1df-0/1623671437639_season-1819_csv.csv, and then ingested.
Data Preview: Once the data has been ingested, it can be previewed and is ready for you to begin querying.

The following section...

Ingesting data using KQL management commands

In the previous section, we imported our English Premier League data and you may have noticed that over half of the columns were related to betting statistics. In this section, we will create a custom CSV mapping schema and exclude those columns.

We will also introduce some KQL management commands. Like SQL, KQL has two categories of commands – data and management. The data commands allow us to query our data and the management commands allow us to manage our clusters, databases, tables, and schemas. We will cover KQL in depth in the next chapter, Introducing Kusto Query Language.

The first step is to create a table with the columns that we are interested in. When creating tables, we use the .create table command.

We will now specify our columns and their data types as shown in the following code snippet. Here, we are creating a table with clear column names and are not including any of the betting statistics. You may have...

Ingesting data from Blob storage using Azure Event Grid

In our final example of data ingestion, we will enable streaming on our cluster and use Azure Event Grid and Event Hubs so we can ingest data whenever new files are placed in our storage account's blob container. A blob container is a location on the storage account used to store our data.

For this section, we need to create the following Azure resources:

A storage account for storing files
An event grid to emit blob creation events
An event hub deliver the notification to Azure Data Explorer

Using JSON data, we will demonstrate how to create JSON-based mapping schemas.

When a file is uploaded to the storage account, a blob created event is generated and received by the event grid. The event grid then updates Azure Data Explorer to pull information from the storage account. In our example, the information is a JSON file.

This path of data ingestion is shown in the following figure:

...

Summary

This chapter covered a lot of topics to ingest… I mean digest – pardon the pun. We started by learning about data ingestion in general, discussing the different types of data, such as structured, semi-structured, and unstructured. We then looked at the data management service in more detail to understand its role with regard to data ingestion. We also looked at the difference between batching and streaming data. We introduced the main ingestion categories: SDKs, managed pipelines such as Azure Event Grid, connections and plugins, and tools such as Azure Data Factory and one-click ingestion.

Then we learned about schema mapping, how they map external data to the columns in our ADX tables, and how to write our own schema maps for both CSV and JSON data. We created two schema maps for the English Premier League football results data, one for the CSV-based data and one for the JSON data, where we excluded the betting information and kept the actual football match...

Questions

Before moving on to the next chapter, test your knowledge by trying these exercises. The answers can be found at the back of the book.

Which is the preferred ingestion method – streaming or batch ingestion?
Try to import the StormEvents CSV file using the one-click ingestion method. We will use the StormEvents table in the next chapter.
Try to upload another Premier league JSON file.
Update EPL_Custom_JSON_Mapping and include the referee.

The rest of the chapter is locked

You have been reading a chapter from

Scalable Data Analytics with Azure Data Explorer

Published in: Mar 2022Publisher: PacktISBN-13: 9781801078542

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jason Myerscough

Jason Myerscough is a director of Site Reliability Engineering and cloud architect at Nuance Communications. He has been working with Azure daily since 2015. He has migrated his company's flagship product to Azure and designed the environment to be secure and scalable across 16 different Azure regions by applying cloud best practices and governance. He is currently certified as an Azure Administrator (AZ-103) and an Azure DevOps Expert (AZ-400). He holds a first-class bachelor's degree with honors in software engineering and a first class master’s degree in computing.
Read more about Jason Myerscough

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages