1. Introducing Analytics on Azure
According to a survey by Dresner Advisory Service in 2019, an all-time high of 48% of organizations say business intelligence on the cloud is either critical or very important in conducting their business operations. Cloud Computing and Business Intelligence Market Study also showed that sales and marketing teams get the most value out of analytics.
As businesses grow, they generate massive amounts of data every day. This data comes from different sources such as mobile phones, the Internet of Things (IoT) sensors, and various Software-as-a-Service (SAAS) products such as Customer Relationship Managers (CRMs). Enterprises and businesses need to scale and modernize their data architecture and infrastructure in order to cope with the demand to stay competitive in their respective industries.
Having cloud-scale analytics capability is the go-to strategy for achieving this growth. Instead of managing your own data center, harnessing the power of the cloud allows your businesses to be more accessible to your users. With the help of a cloud service provider such as Microsoft Azure, you can accelerate your data analytics practice without the limitations of your IT infrastructure. The game has changed in terms of maintaining IT infrastructures, as data lakes and the cloud data warehouse are capable of storing and maintaining massive amounts of data.
Simply gathering data does not add value to your business; you need to derive insights from it and help your business grow using data analytics. Azure is more than just a hub for gathering data; it is an invaluable resource for data analytics. Data analytics provides you with the ability to understand your business and customers better. By applying various data science concepts, such as machine learning, regression analysis, classification algorithms, and time series forecasting, you can test your hypotheses and make data-driven decisions for the future. However, one of the challenges that organizations continuously face is how to derive these analytical modeling capabilities quickly when processing billions of data rows. This is where having a modern data warehouse and data pipeline can help (more on this in the next sections).
There are a number of ways by which data analytics can help your business thrive. In the case of retail, if you understand your customers better, you will have a better idea of what products you should sell, where to sell them, when to sell them, and how to sell them. In the financial sector, data analytics is helping fight crimes by detecting fraudulent transactions and providing more informed risk assessments based on historical criminal intelligence.
This chapter will cover fundamental topics on the power of data with big data analytics, IoT, Machine Learning (ML) and Artificial Intelligence (AI), and DataOps. You will also learn why Microsoft Azure is the platform of choice for performing analytics on cloud. Lastly, you will study the fundamental concepts of a modern data warehouse and data pipelines.
The Power of Data
As a consumer, you have seen how the advent of data has influenced our activities in the daily grind. Most popular entertainment applications, such as YouTube, now provide a customized user experience with features such as video recommendations based on our interests and search history logging information. It is now child's play to discover new content that's similar to our preferred content and also to find new and popular trending content.
Due to the major shift in wearable technology, it has also become possible to keep track of our health statistics by monitoring heart rates, blood pressure, and so on. These devices then formulate a tailored recommendation based on the averages of these statistics. But these personalized health stats are only a sample of the massive data collection happening every day on a global scale—in which we actively contribute.
Millions of people all over the world use social networking platforms and search engines every day. Internet giants like Facebook, Instagram, and Google use clickstream data to come up with new innovations and improve their services.
Data collection is also carried out on an extensive level under projects such as The Great Elephant Census, eBird that aim to boost wildlife conversation. Data driven techniques have even been adopted for tiger conservation projects in India.
It even plays an invaluable role in global efforts to compile evidence, causes, and possible responses to climate change—to understand sea surface temperature, analyze natural calamities such as coastal flooding, and global warming patterns in a collective effort to save the ecosystem.
Organizations like Global Open Data for Agriculture and Nutrition (GODAN), which can be used farmers, ranchers, and consumers alike, contribute to this tireless data collection as well.
Furthermore (as with the advent of wearable technology), data analysis is contributing to pioneering advancements in the healthcare sector. Patient datasets are analyzed to identify patterns and early symptoms of diseases in order to divine better solutions to known problems.
The scale of data being talked about here is massive—hence, the popular term "big data" is used to describe the harnessing power of this data at scale.
You can read more about this here.
Big Data Analytics
The term big data is often used to describe massive volumes of data that traditional tools cannot handle. It can be characterized with five Vs:
- Volume: This indicates the volume of data that needs to be analyzed for big data analytics. We are now dealing with larger datasets than ever before. This has been made possible because of the availability of electronic products such as mobile devices and IoT sensors that have been widely adopted all over the globe for commercial purposes.
- Velocity: This refers to the rate at which data is being generated. Devices and platforms, such as those just mentioned, constantly produce data on a large scale and at rapid speed. This makes collecting, processing, analyzing, and serving data at rapid speeds necessary.
- Variety: This refers to the structure of data being produced. Data sources are inconsistent. Some are structured and some are unstructured. (You will learn more about this in the following pages.)
- Value: This refers to the value of the data being extracted. Accessible data may not always be valuable. With the right tools, you can derive value from the data in a cost-effective and scalable way.
- Veracity: This is the quality or trustworthiness of data. A raw dataset will usually contain a lot of noise and bias and will need cleaning. Having a large dataset is not useful if most of the data is not accurate.
Big data analytics is the process of finding patterns, trends, and correlations in unstructured data to derive meaningful insights that shape business decisions. This unstructured data is usually large in file size (images, videos, and social graphs, for instance).
This does not mean that relational databases are not relevant for big data. In fact, modern data warehouse platforms like Azure Synapse Analytics (formerly known as Azure SQL Data Warehouse) support structured and semi-structured data (such as JSON) and can infinitely scale to support terabytes to petabytes of data. Using Microsoft Azure, you have the flexibility to choose any platform. These technologies can complement each other to achieve a robust data analytics pipeline.
Here are some of the best use cases of big data analytics:
- Social media analysis: Through social media sites such as Twitter, Facebook, and Instagram, companies can learn what customers are saying about their products and services. Social media analysis helps companies to target their audiences depending on user preferences and market trends. The challenges here are the massive amount of data and the unstructured nature of tweets and posts.
- Fraud prevention: This is one of the most familiar use cases of big data. One of the prominent features of big data analytics when used for fraud prevention is the ability to detect anomalies in a dataset. Validating credit card transactions by understanding transaction patterns such as location data and categories of purchased items is an example of this. The biggest challenge here is ensuring that the AI/ML models are clean and unbiased. There might be a chance that the model was trained just for a specific parameter such as user's country of origin, hence the model will focus on determining patterns on just user's location and might miss out on other parameters.
- Price optimization: Using big data analytics, you can predict what price points will yield the best results based on historical market data. This allows companies to ensure that they do not price their items too high or too low. The challenge here is that there are many factors that can affect prices. Focusing on just a specific factor such as competitor's price might eventually train your model to just focus on that area, and may disregard other factors such as weather and traffic data.
Big data for businesses and enterprises is usually accompanied by the concept of having an IoT infrastructure, where hundreds, thousands, or even millions of devices are connected to a network that constantly sends data to a server.
Internet of Things (IoT)
IoT plays a vital role in scaling your application to go beyond your current data sources. IoT is simply an interconnection of devices that are embedded to serve a single purpose in objects around us to send and receive data. IoT allows us to constantly gather more data about "things" without manually encoding them into a database.
A smartwatch is a good example of an IoT device that constantly measures your body's vital measurements. Instead of getting a measuring device and encoding it to a system, a smartwatch allows you to record your data automatically. Another good example is a device tracker for an asset which captures location, temperature, and humidity information. This allows logistics companies to monitor their items in transit, ensuring the quality and efficiency of their services.
At scale, these IoT devices generate anywhere from gigabytes to terabytes of data. This data is usually stored in a data lake in a raw, unstructured format, and is later analyzed to derive business insights. A data lake is a centralized repository of all structured, semi-structured, and unstructured data. In the example of the logistic company mentioned above, patterns (such as the best delivery routes) could be generated. The data could also be used to understand anomalies such as data leakage or suspected fraudulent activities.
Machine Learning and Artificial Intelligence
As your data grows in size, it opens a lot of opportunities for businesses to go beyond understanding business trends and patterns. Machine learning and artificial intelligence are examples of innovations that you can exploit with your data. Building your artificial intelligence and machine learning capability is relatively easy now because of the availability of technologies and the ability to scale your storage and compute on the cloud.
ML and AI are terms that are often mixed together. In a nutshell, machine learning is a subset (or application) of artificial intelligence. Machine learning aims to allow systems to learn from past datasets and adapt automatically without human assistance. This is made possible by a series of algorithms being applied to the dataset; the algorithm analyzes the data in near-real time and then comes up with possible actions based on accuracy or confidence derived from previous experience.
The word 'learning' indicates that the program is constantly learning from the data fed to it. The aim of machine learning is to strive for accuracy rather than success. There are two main categories of machine learning algorithms: supervised and unsupervised.
Supervised machine learning algorithms create a mapping function to map the input variables with the output variable. The algorithm uses the existing datasets to train itself for predicting the output. Classification is a form of supervised ML that can be used in applications such as image categorization or customer segmentation, which is used for targeted marketing campaigns.
Unsupervised machine learning, on the other hand, is when you let the program find a pattern of its own without any labels. A good example is understanding customer purchase patterns when buying products. You get inherent groupings (clustering) according to purchasing behaviors, and the program can associate customers and products according to patterns of purchase. For instance, you may discern that customers who buy Product A tend to buy Product B too. This is an example of a user-based recommendation algorithm and market-based analysis. What it would eventually mean for users is that when they buy a particular item, such as a book, the user is also encouraged to buy other books that belong to the same series, genre, or category.
Artificial intelligence extends beyond what machine learning can do. It is about making decisions and aiming for success rather than accuracy. One way to think of it is that machine learning aims to gain knowledge while artificial intelligence aims for wisdom or intelligence. An example of AI in action would be Boston Dynamic's Atlas robot, which can navigate freely in the open world and avoid obstacles without the aid of human control. The robot does not fully depend on the historical map data to navigate. However, for machine learning, it's about creating or predicting a pattern from historical data analysis. Similar to the robot's navigation, it is about understanding the most optimal route by creating patterns based from historical and crowd-sourced traffic data.
Setting up a modern data warehouse with cloud analytics is the key factor in preparing to execute ML/AI. Without migrating the workloads to the cloud, deriving ML/AI models will encounter various roadblocks in order to maximize the business value of these emerging technologies. A modern data warehouse and analytics pipeline are the backbone that enables these.
Microsoft is a leader in machine learning and artificial intelligence as they have been driving a lot of innovation throughout their products and tools—for instance, Window's digital assistant, Cortana, and Office 365's live captions and subtitles. They offer a range of products, tools, and services such as Microsoft Cognitive Services, ML Studio, Azure Machine Learning Service and ML.NET.
Microsoft is setting an example with their initiative AI for Good, which aims to make the world more sustainable and accessible through AI. One interesting project in particular is AI for Snow Leopards, in which Microsoft uses AI technology to detect snow leopards (who are almost invisible in snow) in order to protect the endangered species.
Exploring artificial intelligence and deep learning specifically the data science and formula aspects are not the focus of this book, but you will tackle some concepts once in a while in later chapters (see more on this in Chapter 3, Processing and Visualizing Data).
In order to be efficient and agile with implementing data analytics in your company, you need the right culture and processes. This is where the concept of DataOps comes in. DataOps removes the co-ordination barrier between data (analysts, engineers, and scientists) and operations (administrators and operations managers) teams in order to achieve speed and accuracy in data analytics.
DataOps is about a culture of collaboration between different roles and functions. Data scientists have access to real-time data to explore, prepare, and serve. Automated processes and flows prove invaluable to this collaborative effort between analysts and developers, as it provides easy access to this data through visualization tools. Relevant data should be served to end users via web or mobile applications; this is usually possible with a use of Application Programming Interface (API). For CEOs, DataOps means faster decision-making, as it allows them to monitor their business at a high level without waiting for team leaders to report. The following figure tries to explain the idea of a collaborative DataOps culture:
Figure 1.1: DataOps process
Once a team attains the desired speed and accuracy in testing their hypotheses (such as the likelihood of someone buying a product based on her/his characteristics and behavior), they are able to derive better insights. Once there are better insights, there are more actionable and reasonable decision points for business stakeholders that minimize risks and maximize profits.
Why Microsoft Azure?
Microsoft Azure is an enterprise-grade set of cloud computing services created by Microsoft using their own managed data centers. Azure is the only cloud with a true end-to-end analytics solution. With Azure, analysts can derive insights in seconds from all enterprise data. Azure provides a mature and robust data flow without limitations on concurrency.
Azure supports Infrastructure as a Service (IAAS), Platform as a Service (PAAS), and SAAS. Many government institutions across the world as well as 95% of Fortune 500 companies use Azure, ranging from industries such as healthcare and financial services to retail, and manufacturing.
Microsoft is a technology conglomerate that has empowered many people to achieve more with less, for decades, with their software, tools, and platforms. Azure provides flexibility. Familiar Microsoft tools and infrastructures (such as SQL Server, Windows Server, Internet Information Services (IIS), and .NET) or tools such as MySQL, Linux, PHP, Python, Java, or any other open source technologies can all run on the Azure cloud. Gone are the days when you could only work on a walled-garden set of tools and technologies.
Azure provides you with various products and services, depending on your needs. You have the option of doing everything in a bespoke way, from managing your IAAS by spinning up Windows Server virtual machines with Enterprise SQL Server installed, to using a managed PAAS offering such as Azure Synapse Analytics (more on this in Chapter 2, Building Your Modern Data Warehouse).
The following figure shows the wide range of data-specific Azure tools and services that can be used to create end-to-end data pipelines:
Figure 1.2: Microsoft Azure data-related services
Azure grants you the flexibility to choose the best approach to solve a problem for yourself, rather than be forced to bend a less adaptable product to perform an unnatural function. You're not just limited to SQL Server, either. You also have the flexibility to choose other types of databases or storage, whether through a service installed on a Linux server or containerized solution, or a managed platform (such as Azure Cosmos DB for your Cassandra and MongoDB instances). This is very important because, in the real world, different scenarios require different solutions, tools, and products.
Microsoft Azure provides you with an end-to-end platform, from Azure Active Directory for managing your user identity and access, to Azure IoT offerings (such as IoT Hub) for gathering data from hundreds and thousands of IoT devices. It also provides services such as development tools and cloud hosting options for getting your developers up to speed, as well as various analytics and machine learning tools which enable data scientists, data engineers, and data analysts to be more productive (more on this in Chapter 3, Processing and Visualizing Data).
The full spectrum of Azure services is too wide to cover here, so instead, this book will focus on the key data warehousing and business intelligence suite of products: Azure Data Factory, Azure Data Lake, Azure Synapse Analytics, Azure Databricks, Azure Analysis Services, Power BI, and Azure Machine Learning (refer to Chapter 2, Building Your Modern Data Warehouse, and Chapter 3, Processing and Visualizing Data).
Microsoft views security as the top priority. When it comes to data, privacy and security are non-negotiable; there will always be threats. Azure has the most advanced security and privacy features in the analytics space. Azure services support data protection through Virtual Networks (VNets) so that, even though they are in the cloud, data points cannot be accessed by the public internet. Only the users in the same VNet can communicate with each other. For web applications, you get a Web Application Firewall (WAF) provided by Azure Application Gateway, which ensures that only valid requests can get into your network.
With role-based access control (authorization), you can ensure that only those with the right roles, such as administrators, have access to specific components and the capabilities of different resources. Authentication, on the other hand, ensures that if you don't have the right credentials (such as passwords), you will not able to access a resource. Authorization and authentication are built into various services and components of Microsoft Azure with the help of Azure Active Directory.
Azure also provides a service called Azure Key Vault. Key Vault allows you to safely store and manage secrets and passwords, create encryption keys, and manage certificates so that applications do not have direct access to private keys. By following this pattern with Key Vault, you do not have to hardcode your secrets and passwords in your source code and script repository.
Azure Synapse Analytics uses ML and AI to protect your data. In Azure SQL, Microsoft provides advanced data security to ensure that your data is protected. This includes understanding if your database has vulnerabilities, such as port numbers, that are publicly available. These capabilities also allow you to be more compliant to various standards, such as GDPR (General Data Protection Regulation), by ensuring that customer data that are considered sensitive are classified. Azure SQL also recently announced their new features, row-level security (RLS) and column-level security (CLS), to control access to rows and columns in a database table, based on the user characteristics.
Microsoft invests at least 1 billion dollars each year in the cybersecurity space, including the Azure platform. Azure holds various credentials and awards from independent assessment bodies, which ensures that you can trust Azure in all security aspects, from physical security (such that no unauthorized users can get physical access to data centers) to application-level security.
These are a few security features that you need to consider if you are maintaining your own data center.
Azure changed the industry by making data analytics cost-efficient. Before the mass adoption of cloud computing, in order to plan for data analytics with terabytes, or even petabytes, of data, you needed to properly plan things and ensure that you had the capital expenditure to do it. This would mean a very high upfront infrastructure and professional services costs, just to get started. But, with Azure, you can start small (many of the services have free tiers). You can scale your cloud resources effortlessly up or down, in or out, within minutes. Azure has democratized scaling capability by making it economically viable and accessible for everyone, as shown in Figure 1.3:
Figure 1.3: Microsoft Azure regions
Microsoft Azure currently has 54 regions supporting 140 countries. Some enterprises and business industries require that your data is hosted within the same country as business operations. With the availability of different data centers worldwide, it is easy for you to expand to other regions. This multi-region approach is also beneficial in terms of making your applications highly available.
The true power of the cloud is its elasticity. This allows you to not only scale resources up, but also scale them down when necessary. In data science, this is very useful because data science entails variable workloads. When data scientists and engineers are analyzing a dataset, for instance, there is a need for more computation. Azure, through Databricks (more on this in Chapter 2, Building Your Modern Data Warehouse), allows you to scale according to demand. Then, during off-peak times (such as 7 PM to 7 AM on weekdays and weekends), when the scientists and engineers don't need the processing power to analyze data, you can scale down your resources so that you don't have to pay for running resources 24/7. Databricks basically offers a "pay-as-you-go" or "pay-what-you-use" service.
Azure also provides a Service Level Agreement (SLA) for their services as their commitments to ensure uptime and connectivity for their production customers. If downtime or an incident occurs, they will apply service credits (rebates) to the resources that were affected. This will give you peace of mind as your application will always be available with a minimal amount of downtime.
There are different scaling approaches and patterns that Microsoft Azure provides:
- Vertical scaling: This is when more resources are added to the same instance (server or service). An example of this is when a virtual machine is scaled up from 4 GB of RAM to 16 GB of RAM. This is a simple and straightforward approach to take when your application needs to scale. However, there is a technical maximum limit on how much an instance can be scaled up, and it is the most expensive scaling approach.
- Horizontal scaling: This is when you deploy your application to multiple instances. This would logically mean that you can scale your application infinitely because you don't use a single machine to perform your operations. This flexibility also introduces some complexities. These complexities are usually addressed by performing various patterns and the use of different orchestration technologies, such as Docker and Kubernetes.
- Geographical scaling: This is when you scale your applications to different geographical locations for two major reasons: resilience and reduced latency. Resilience allows your application to freely operate in that region without all resources being connected to a master region. Reduced latency would mean users of that region can get their web requests faster because of their proximity to the data center.
- Sharding: This is one of the techniques for distributing huge volumes of related, structured data onto multiple independent databases.
- Development, Testing, Acceptance, and Production (DTAP): This is the approach of having multiple instances living in different logical environments. This is usually done to separate development and test servers from staging and production servers. Azure DevTest Labs offers a development and testing environment that can be configured with group policies.
Another advantage of your business being cloud-scale is the availability of your services. With Azure, it is easier to make your infrastructure and resources geo-redundant—that is, available to multiple regions and data centers across the world. Say you want to expand your business from Australia to Canada. You can achieve that by making your SQL Server geo-redundant so that Canadian users do not need to query against the application and database instance in Australia.
Azure, despite being a collective suite of products and service offerings, does not force you to go "all in". This means that you can start by implementing a hybrid architecture of combined on-premises data centers and cloud (Azure). There are different approaches and technologies involved in a hybrid solution, such as using Virtual Private Networks (VPNs) and Azure ExpressRoute, if you need dedicated access.
With Azure Data Factory (there'll be more on this in Chapter 2, Building Your Modern Data Warehouse), Azure allows you to get a snapshot of data sources from your on-premises SQL Server. The same concept applies when you have other data sources from other cloud providers or SAAS products; you have the flexibility to get a copy of that data to your Azure data lake. This flexibility is highly convenient because it does not put you in a vendor lock-in position where you need to do a full migration.
Top Business Drivers for Adopting Data Analytics on the Cloud
Different companies have different reasons for adopting data analytics using a public cloud like Microsoft Azure. But more often than not, it boils down to three major reasons: rapid growth and scale, reducing costs, and driving innovation.
Rapid Growth and Scale
Enterprises and businesses need to rapidly expand their digital footprint. With the rapid growth of mobile applications—particularly, media types (such as images and videos), IoT sensors, and social media data—there is just so much data to capture. This means enterprises and businesses need to scale their infrastructure to support these massive demands. Company database sizes continuously grow from gigabytes of data to terabytes, or even petabytes, of data.
End users are more demanding now than ever. If your application does not respond within seconds, the user is more likely to disengage with your service or product.
Scaling does not only apply to the consumers of the applications; it is also important for data scientists, data engineers, and data analysts in order to analyze a company's data. Scaling an infrastructure is vital, as you cannot expect your data engineers to handle massive chunks of data (gigabytes to terabytes) and run scripts for testing your data models on a single machine. Even if you do serve this in a single high-performance server instance, it's going to take weeks or days for it to finish the test. Not to mention the fact that it's going to cause performance bottlenecks for the end users who are consuming the same database.
With a modern data warehouse like Azure Synapse Analytics, you have some managed capabilities to scale, such as a dedicated caching layer. Caching will allow analysts, engineers, and scientists to query faster.
Due to scaling demands, enterprises and businesses need to have a mechanism to expand their data infrastructure in a cost-effective and financially viable way. It is too expensive to set up a data warehouse on premises. The following are just a few of the cost considerations:
- The waiting time for server delivery and associated internal procurement processes
- Networking and other physical infrastructure costs, such as hardware-cooling and data center real estate
- Professional services costs associated with setting up and maintaining these servers
- Licensing costs (if any)
- The productivity lost from people and teams who cannot ship their products faster
With a modern data warehouse, you can spin up new high-performance servers with high-performance graphics cards on demand. And with the use of a cloud provider such as Microsoft Azure, you will only need to pay for the time that you use these servers. You can shut them down if you don't need them anymore. Not only can you turn them off on demand, but if it turns out that a particular service is not suitable to your requirements, you can delete these resources and just provision a different service.
Azure also provides a discount for "reserved" instances that you are committing to use for a specific amount of time. These are very helpful for those databases, storage solutions, and applications that need to be up 24/7 with minimal downtime.
Companies need to constantly innovate in this very competitive market, otherwise someone else will rise up and take the market share. But obviously, no one can predict the future with 100% accuracy; hence, companies need to have a mechanism to explore new things based on what they know.
One good example of this is the Business Process Outsourcing (BPO) and telecommunications (telco) industries, where there are petabytes of data that may not have been explored yet. With Microsoft Azure's modern data warehouse, actors in such industries can have the infrastructure to do data exploration. With Azure Data Lake, Azure Data Factory, Azure Synapse Analytics, Azure Databricks, Power BI, and Azure Machine Learning, they can explore their data to drive business decisions. Maybe they can come up with a data model that can detect fraudulent actions or better understand their customer preferences and expectations to improve satisfaction ratings. With advanced analytics, these companies can come up with decisions that are relevant today (and possibly in the future) and are not just restricted to analyzing historical data.
What if you want to create an autonomous vehicle? You will need a robust data warehouse to store your datasets and a tremendous amount of data processing. You need to capture massive amounts of data—whether through pictures or videos that the car is continuously capturing—and need to come up with a response almost instantly based on your data set and algorithms.
Using a cloud provider like Microsoft Azure would allow you to test and validate your ideas early on, without a massive investment. With Azure, you can rapidly prototype your ideas and explore possibilities. What if it turns out that the product or service that you or your team is working on does not really gain traction? If you are doing this on-premises, you will still have high liability and operations costs since you physically own the infrastructure, in addition to any associated licensing and services costs.
Why Do You Need a Modern Data Warehouse?
A data warehouse is a centralized repository that aggregates different (often disparate) data sources. The main difference between a data warehouse and a database is that data warehouses are meant for OLAP (Online Analytical Processing) and databases, on the other hand, are intended for OLTP (Online Transaction Processing). OLAP means that data warehouses are primarily used to generate Analytics, Business Intelligence, and even Machine Learning models. OLTP means that databases are primarily used for transactions. These transactions are the day-to-day operations of applications where they concurrently read and write data to databases.
A data warehouse is essential if you want to analyze your big data as it also contains historical data (often called cold data). Most of the data stored has legacy information, such as data stored 5 years ago, 10 years ago, or even 15 years ago. You probably don't want the same database instance that your end users are querying against to also contain that historical data, as it might affect your performance when at scale.
Here are some of the advantages of having a modern data warehouse:
- Supports any data source
- Highly scalable and available
- Provides insights from analytical dashboards in real-time
- Supports a machine learning environment
The various tools and services that constitute the modern data warehouse are connected with each other as follows:
Figure 1.4: Modern Data Warehouse architecture
There are a lot of emerging patterns and architectures for data warehousing, but the most popular ones are those that support separation of duties and responsibilities on different phases of the data pipeline (more on this in the Creating a Data Pipeline section).
In order to understand what it means for a data warehouse to be modern; you first need to understand how you create and manage a traditional one. It boils down to two major concepts:
- Compute: This refers to the ability to process the data and make sense out of it. It can be in the form of a database query to make the results accessible to another interface, such as web applications.
- Storage: This refers to the ability to keep data in order for it to be accessible at any time in the future.
A modern data warehouse separates compute and storage in cost-effective ways. Unlike the case traditionally with SQL Server and SQL Server Integration Services (SSIS), the pricing model involves both the storage capacity and computing power to analyze data. Azure is the first cloud provider to offer a data warehouse that separates compute and storage.
Another change in pattern is that the traditional ETL (Extract-Transform-Load) model of data warehousing has now changed to ELT (Extract-Load-Transform). In the traditional ETL model, analysts are accustomed to waiting for the data to be transformed first, since they don't have direct access to all data sources. In a modern data warehouse, massive amounts of data can be stored to either a data lake or data warehouse, and can be transformed anytime by analysts without the need to wait for data engineers or database admins to serve the data.
Of course, there are more factors to consider in order to modernize your data warehouse, such as extensibility, disaster recovery, and availability. However, this section will focus on compute for the time being.
Bringing Your Data Together
In the past, databases were often the only source of data for your applications. But nowadays, you have hundreds and thousands of different data sources. The data coming from these different sources has different data types—some structured, some unstructured.
Structured Data: The word 'structured' suggests that there is a pattern that can be easily interpreted. This usually comes with a predefined set of models and a schema. A Relational Database Management System (RDBMS) such as Microsoft SQL Server is a common example of data storage solution that is structured. This is because it comes with a database schema and table columns that define the data that you are storing
Here are some examples of structured data types:
- Customer name(s)
- Date and time
- Mobile and phone numbers
- Credit card numbers
- Product names and Stock Keeping Units (SKUs)
- General transaction information such as "From" and "To" with time stamps and amount values
A good example of structured data is the information provided by the users when signing up to an application for the first time. They are presented with a form that needs to be filled. Once that person clicks the submit button, it sends the data to a database and inserts it into a user table with predefined columns: names, addresses, and other details. This will then allow the user to log into the application since the system can now look up the existing record for the registered user in the database.
From there, a user can access the application and perform transactions, such as transferring money and assets. In time, users will generate a series of transactions that will eventually make your database larger. Your database schema will also expand to support different business requirements.
Once you have enough data, you can perform data exploration. This is where you start looking for patterns in data. You may identify fraudulent transactions and test hypotheses by analyzing large and repeated transaction amounts from the same user.
Your data exploration is limited because you can only base it on a dataset that is structured and with semantic form. What if you also want to consider other data sources that are unstructured, such as free-form text? An example is a transaction description, that may state the nature or the recipient of the transaction. You don't want to manually read each transaction description and insert it in the right column of a database table. You probably want to extract only the relevant information and transform it into a structured format. This is where unstructured data comes in.
Unstructured Data: This data type, more or less, is the "the rest"—that is, everything that isn't structured data. This is mainly because you are not limited to anything. Unstructured data types usually don't have a predefined data model that can fit directly to a database. Unstructured data can be "text-heavy" and is usually read per line or is space-separated. Here are some examples of unstructured data sources:
- Image files
- Email messages and documents
- Log files
- IoT devices and sensors
- NoSQL databases such as MongoDB
- Social media and Microsoft Graph
Image files and videos are classified as unstructured data because of their dynamic nature. Although their metadata is something you can consider as structured (like title, artist, filename, etc.), the content itself is unstructured. With modern tools and data analytics technology, you can now examine this data and make sense of it. The usual example is face recognition in either images or videos.
Emails, documents, and log files all have metadata, but what you're actually more interested in is the content of those files. Usually, in emails, documents, and log files, data is separated per line and the messages are unstructured. Here you would want to describe the content without manually reading everything (which could be hundreds, or even millions of files). An example is doing sentiment analysis on content to determine whether the prevailing emotion is happy, sad, or angry. For log files, you probably want to separate the error messages, time stamps (dates), and measurements (traces) between messages.
IoT devices and sensors, similarly to log files, are used to capture measurements and errors about a certain item. The main difference is that these devices usually work on a large scale of clusters (hundreds to thousands of devices) and continuously stream data. Data generated from these devices is semi-structured or unstructured since it is in JSON or XML format. Modern technologies, such as Azure IoT services, already solve these complexities with services like Azure IoT Hub, which aggregates all this data from various sensors and continuously exports it to a data source. Sometimes you can classify this data as semi-structured since these traces and logs are things that a system can easily understand.
Social media platforms and Microsoft Graph both provide semi-structured data. It is classified this way because just querying all of Twitter's tweets about a topic is not enough. The results don't really make a lot of sense until you do some analysis of them. The primary focus is to discern patterns and anomalies. For example, you may want to identify trends about news and topics but also want to remove data that is irrelevant, such as tweets coming from fake accounts.
Interestingly, some Line of Business (LOB) applications provide both structured and unstructured data. For example, both Microsoft Dynamics CRM and Salesforce provide structured data that can easily be interpreted and exported to your SQL database tables, such as data for products and their amounts and value. However, they also support unstructured data such as images, videos, and "note text". Note that, even though "note text" is a string, it can still be considered as unstructured data because it is designed to be "free text". It doesn't have a proper format to follow, but it is still worth exploring. A common scenario for its use is to understand why sales were not successful.
Creating a Data Pipeline
Once you have identified your data sources, the next step is to create a data pipeline (sometimes also referred to as a data flow). At a high level, the steps involved are data ingestion, data storage, data preparation and training, data modeling and serving, and data visualization.
With this approach, you will build a highly scalable architecture that serves all the users of the system: from end users, data engineers and scientists who are doing the data exploration, and analysts who interpret the data for the business, to even the CEO if she/he wants to see what's happening with the business in real time:
Data ingestion means transferring data (structured and unstructured) from the source to your storage, data lake, or data warehouse.
This would involve something such as Azure Data Factory (more on this in Chapter 2, Building Your Modern Data Warehouse) transferring data from various sources such as on-premises databases and SAAS products to a data lake. This step allows you to manage your Extract- Transform-Load (ETL) and Extract-Load-Transform (ELT) workflows without the need for manual reconciliation.
This is not a one-time process. Ideally, this is something you schedule or set to be triggered so that your data lake gets a historical snapshot from time to time. An example of this is a connection from your CRM tools, such as Microsoft Dynamics CRM, to Azure Data Lake by means of Azure Data Factory. This will allow data scientists and data engineers to explore this data at different time intervals without interrupting the actual CRM application.
Once the data has been ingested from various data sources, all the data is stored in a data lake. The data residing within the lake will still be in a raw format and includes both structured and unstructured data formats. At this point, the data won't bring much value to drive business insights.
Data Pipeline Orchestration & Monitoring
In a modern data warehouse scenario, it is very important that data sources and services efficiently transfer data from source to destination. Azure Data Factory is an orchestrator that allows services to perform the data migration or transfer. It is not the one performing the actual transfer, but rather instructs a service to perform it—for example, a Hadoop Cluster to perform a Hive query.
Azure Data Factory also allows you to create alerts and metrics to notify you when the service orchestration is working. You can create an alert via email, in case a data transfer was not successful from source to destination.
In a modern data warehouse pattern, sharing data should be both seamless and secure. Often times this can be done via FTP (File Transport Protocol), emails, or APIs just to name a few. There is a big management overhead if you want to share data at scale. Azure Data Share allows you to securely manage and share your big data to other parties and organizations. The data provider will have full control of who can access the data sets and the permissions each can perform. This makes it easier for the dependent companies to derive insights and explore AI scenarios.
Once data is ingested, the next step is data preparation. This is a phase where the data from different data sources is pre-processed for data analytics purposes. An example of this is querying data from an API and inserting them to a database table. Azure Data Factory allows you to orchestrate this data preparation. Azure Databricks can also help with data preparation, as it can run clusters concurrently to process massive amounts of data in just a matter of seconds or minutes.
Data Transform, Predict, & Enrich
Sometimes, the data preparation requires further changes beyond a simple "copy-and-paste" scenario. This is where Data Transformation comes in. There are instances wherein you want to apply custom logic in the raw data first—applying filters, for instance—before you decide to transfer it to a data warehouse. Azure Data Factory and Azure Databricks can also help in this scenario.
Furthermore, you can enrich the batch data at scale by invoking an Azure Machine Learning Service that makes real-time predictions about the data. This can be performed as an added feature in your data pipeline in Azure Data Factory. To learn more about Azure Machine Learning Service, see Chapter 3, Processing and Visualizing Data.
After preparing and training your data, you'll be ready to model and serve it to the consumers. Basically, in this phase, you are modeling the data to be easily understood by systems. This usually involves performing the complex queries you generated from the data preparation and training phase, and inserting these records into a database so that the data is structured in a defined table and schema.
All of your company's analytical data is stored in a data warehouse. You potentially have hundreds to thousands of concurrent users, reports, and dashboards running off a single data warehouse.
You usually perform Data Modeling and Service Integrations with a data warehouse platform such as Microsoft Azure Synapse Analytics. Complex and complete queries can take hours or days. But with the power of the cloud, you can scale your Azure Synapse Analytics to perform these queries faster, making days into hours and hours into minutes (more details on this in Chapter 2, Building Your Modern Data Warehouse).
Data visualization is an efficient way of analyzing performance through graphs and charts. This is called business intelligence. Tools such as Power BI help analysts to get the most out of data. Data visualization provides a rich and meaningful representation of your data that adds business value for you and your customers. The team can see trends, outliers, and patterns which help in making data-driven decisions.
Various stakeholders within the organization can collaborate after analyzing the different performance parameters. Is your company selling products well? In what regions do you get most of your sales? With rich data backing up your assumptions, business stakeholders, such as CEOs can make reasonable data-driven decisions to minimize risks. What product lines should you expand? Where should you expand further? These are some of the common questions that you can answer once you have richer data analytics.
Analysts can use desktop or web application tools to create meaningful representations of their data. Below is a sample of a desktop view of Power BI where a user can perform analysis of their company's data and visualize it in graphs:
Figure 1.6: Power BI dashboard on desktop
Once the reports are generated, they can be exported to a workspace where people can work together to improve the reports. Below is a sample view of the same report in a mobile application. Users can add comments and annotations to the report, allowing a faster feedback loop for analysts:
Power BI allows you to create rich personalized dashboards that best suit your requirements and brand. Compared to having presentations with a snapshot of a graph from last week or last month, this mechanism allows you to refresh the same report over and over again.
Machine learning has helped companies build applications and products such as chatbots that perform specific tasks for end users without the need for human intervention. Some common examples are voice assistants, such as Cortana, which actively learn to help us become more productive with our day-to-day tasks.
Other examples are online games in which you can easily track your performance against everyone in the world. You can better understand how you rank against other players, what areas you excel in, what needs improvement, and how you can improve.
The number of tasks you can do with rich data are virtually limitless, but in order to perform them, you need to have the right approach and infrastructure to handle a high level of scaling.
This chapter established the importance of data analytics. It also highlighted several reasons why Microsoft Azure is an ideal platform of choice for achieving business intelligence capabilities on the cloud. It also touched on some fundamental concepts around big data, machine learning, and DataOps. You also learned some of the business drivers for adopting data analytics on the cloud. Lastly, you gained a high-level view of what it takes to have a modern data warehouse.
In the next chapter, you will see how to start building modern data warehouse with Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Synapse Analytics, and related technologies.