Chapter 13: Options for Data Integration
In the previous chapter, we looked at how to architect database solutions that are scalable and secure. This chapter will look at several options available for architects when designing solutions that must work with large datasets for analysis and reporting.
Big data is an industry term for working with terabytes (TB), or even petabytes (PB) of data, to create analytical dashboards and gain insights. Specialist tools are often required to perform this kind of processing, and it would be expensive to build them in your own data center.
Azure provides some of the world's most popular data tools for loading, transforming, and analyzing data. We will examine what a data pipeline looks like and then delve deeper into some of those tools.
Specifically, this chapter will cover the following topics:
- Understanding data flows
- Comparing integration tools
- Exploring data analytics
Understanding data flows
Many organizations gather massive amounts of data and continue to amass data in many different forms from various systems. This data can be used to bring great value to a company.
One example may be an e-commerce company that collects sales and marketing data from its day-to-day operations. By analyzing the data, customer patterns could be ascertained, as well as the relative success of different advertising campaigns. This information could then be used to develop the company website to create a better customer journey or to identify the strongest performing marketing activities so that these can be honed while less effective ones are dropped.
Scientific organizations also make use of data to create better treatments, drugs, and methodologies.
Manufacturers can use data from internet of things (IoT) devices and sensors to optimize supply chains, increase operational efficiencies, or identify risks in products or processes.
Data sources include sales...
Comparing integration tools
One of the greatest benefits of using cloud services such as Azure is that it gives you the ability to create the necessary resources required without needing to invest large amounts of capital. The tools you can choose from cover end-to-end processes and are scaled in and out as needed.
One of the first decisions you may need to consider is where to initially store raw data. Except in the case of streaming analytics, whereby you continually ingest data from a source such as an IoT device (for example, a temperature sensor), you need a place to store and retrieve your data files from.
Azure storage accounts provide storage capabilities in the form of file storage or Blob storage; however, a specific type of account called an Azure Data Lake Storage Gen2 (ADLS Gen2) account might be better suited to data analytics.
ADLS Gen2
ADLS is an optional configuration feature of a standard storage account. One of the key differences is that it supports filesystem...
Exploring data analytics
Once data has been ingested, transformed, and aggregated, the next step will be to analyze and explore it. There are many tools available on the market to achieve this, and one of the most popular is Databricks.
Databricks uses the Apache Spark engine that is well suited to dealing with massive amounts of data due to its internal architecture. Whereas a traditional database server would typically run workloads, Databricks uses Spark clusters built from multiple nodes. Data analytics processes are then distributed between those nodes to process them in parallel, as shown in the following diagram:
Figure 13.6 – Example Spark cluster architecture
Azure Databricks is a managed Databricks service that provides excellent flexibility for creating and using Spark clusters as and when needed.
Azure Databricks
Azure Databricks provides workspaces that multiple users can use to build and run analytics jobs collaboratively. A Databricks workspace contains...
Summary
This chapter looked at a growing capability in the cloud, especially in Azure—data integration and analytics.
Azure provides a range of tools for creating end-to-end data pipelines for storing, ingesting, transforming, aggregating, and analyzing data. So, we started the chapter with a high-level view of what a typical pipeline might look like.
We looked at how to configure Azure Storage to use ADLS Gen2, what extra capabilities this gives you, and how Azure Data Factory can create automated and secure pipelines for data loading and transformation.
Finally, we looked at the two primary tools for exploring and analyzing data with Azure: Azure Databricks and Azure Synapse Analytics.
After reading this chapter, you should have a better understanding of the different components that comprise a data analytics solution, including the strengths of each service and where one might be a better choice over another.
In the next chapter, we conclude Part 4, Applications...
Exam scenario
MegaCorp Inc. is building a new data analytics capability to help understand its marketing campaigns' effectiveness and how they relate to product sales.
Marketing campaign data is exported daily and stored as flat CSV files. Sales data is exported overnight from the sales database into a normalized data warehouse database.
The management team would like data to be automatically imported and aggregated, and then modeled. It is expected that large amounts of data will be processed, and this needs to be performed relatively quickly. The data analytics teams are seasoned developers who are currently using the latest version of Spark.
Design an end-to-end solution that can accommodate the management team's requirements.