Reader small image

You're reading from  Azure for Architects. - Second Edition

Product typeBook
Published inJan 2019
PublisherPackt
ISBN-139781789614503
Edition2nd Edition
Tools
Right arrow
Author (1)
Ritesh Modi
Ritesh Modi
author image
Ritesh Modi

Ritesh Modi is a technologist with more than 18 years of experience. He holds a master's degree in science in AI/ML from LJMU. He has been recognized as a Microsoft Regional Director for his contributions to building tech communities, products, and services. He has published more than 10 tech books in the past and is a cloud architect, speaker, and leader who is popular for his contributions to data centers, Azure, Kubernetes, blockchain, cognitive services, DevOps, AI, and automation.
Read more about Ritesh Modi

Right arrow

Azure Big Data Solutions Using Azure Data Lake Storage and Data Factory

Big data has gaining significant traction in last few years. Specialized tools, software and storage are required to handle them. These tools, platforms and storage were not available as a service few years back. However, with new cloud technology, Azure is providing numerous tools, platform and resources to create big data solution easily.

The following topics will be covered in this chapter:

  • Data integration
  • Extract-Transform-Load (ETL)
  • Data Factory
  • Data Lake Storage
  • Migrating data from Azure Storage to Data Lake Storage

Data integration

We are all well aware of how integration patterns are used for applications: applications composed of multiple services are integrated together using a variety of patterns. However, there is another paradigm that is a requirement for many organizations, known as data integration. This has happened especially during the last decade, when the generation and availability of data has been incredibly high. The velocity, variety, and volume of data being generated has increased drastically, and there is data almost everywhere.

Every organization has many different types of applications, and they all generate data in their own proprietary format. Often, data is also purchased from the marketplace. Even during mergers and amalgamations of organizations, data needs to be migrated and combined.

Data integration refers to the process of bringing data from multiple sources...

ETL

A very popular process known as ETL helps in building a target data source to house data that is consumable by applications. Generally, the data is in a raw format, and to make it consumable, the data should go through the following three distinct phases:

  • Extract: During this phase, data is extracted from multiple places. There could be multiple sources and they all need to be connected to in order to retrieve the data. Extract phases typically use data connectors consisting of connection information related to the target data source. They might also have temporary storage to bring the data from the data source and store it for faster retrieval. This phase is responsible for the ingestion of data.
  • Transform: The data that is available after the extract phase might not be consumable directly by applications. This could be for a variety of reasons. The data might have irregularities...

A primer on Data Factory

Data Factory is a fully managed, highly available, highly scalable, and easy-to-use tool for creating integration solutions and implementing ETL phases. Data Factory helps create new pipelines in a drag and drop fashion using a user interface without writing any code; however, it still provides features to write code in your preferred language.

There are a few important concepts to learn about before using the Data Factory service, which we will be looking into in the following sections:

  • Activities: Activities are individual tasks that enable the execution and processing of logic within a Data Factory pipeline. There are multiple types of activities. There are activities related to data movement, data transformation, and control activities. Each activity has a policy through which it can decide the retry mechanism and retry interval.
  • Pipelines: Pipelines...

A primer on Data Lake Storage

Azure Data Lake Storage provides storage for big data solutions. It is especially designed for storing the large amounts of data that are typically needed in big data solutions. It is an Azure-provided managed service and is therefore completely managed by Azure. Customers need only bring their data and store it in a Data Lake.

There are two versions: version 1 (Gen1) and the current version, version 2 (Gen2). Gen2 has all the functionality of Gen1, with the difference that it is built on top of Azure Blob Storage.

As Azure Blob Storage is highly available, can be replicated multiple times, is disaster ready, and is low in cost, these benefits are transferred to Gen2 Data Lake. Data Lake can store any kind of data, including relational, non-relational, filesystem-based, and hierarchical data.

Creating a Data Lake Gen2 instance is as simple as creating...

Migrating data from Azure Storage to Data Lake Gen2 Storage

In this section, we will be migrating data from Azure Blob Storage to another Azure container of the same Azure Blob Storage instance, and we will also migrate data to an Azure Gen2 Data Lake instance using an Azure Data Factory pipeline. The following are the steps for creating such an end-to-end solution.

Preparing the source storage account

Before we can create Azure Data Factory pipelines and use them for migration, we need to create a new storage account, consisting of a couple of containers, and upload the data files. In the real world, these files and the storage connection would already be prepared.

...

Summary

This was another chapter on handling big data. This chapter dealt with the Azure Data Factory service, which is responsible for providing ETL services on Azure. Since it is a PaaS, it provides unlimited scalability, high availability, and easy-to-configure pipelines. Its integration with Azure DevOps and GitHub is also seamless. We also saw the features and benefits of using Azure Data Lake Gen2 storage for storing any kind of big data. It is a cost-effective, highly scalable, hierarchical data store for handling big data, with compatibility with Azure HDInsight, Databricks, and the Hadoop ecosystem.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure for Architects. - Second Edition
Published in: Jan 2019Publisher: PacktISBN-13: 9781789614503
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Ritesh Modi

Ritesh Modi is a technologist with more than 18 years of experience. He holds a master's degree in science in AI/ML from LJMU. He has been recognized as a Microsoft Regional Director for his contributions to building tech communities, products, and services. He has published more than 10 tech books in the past and is a cloud architect, speaker, and leader who is popular for his contributions to data centers, Azure, Kubernetes, blockchain, cognitive services, DevOps, AI, and automation.
Read more about Ritesh Modi