1. Introducing analytics on Azure
According to a survey by Dresner Advisory Service in 2019, an all-time high of 48% of organizations say business intelligence in the cloud is either critical or very important in conducting their business operations. The Cloud Computing and Business Intelligence Market Study also showed that sales and marketing teams get the most value out of analytics.
As businesses grow, they generate massive amounts of data every day. This data comes from different sources, such as mobile phones, the Internet of Things (IoT) sensors, and various Software as a Service (SaaS) products such as Customer Relationship Management (CRM) systems. Enterprises and businesses need to scale and modernize their data architecture and infrastructure in order to cope with the demand to stay competitive in their respective industries.
Having cloud-scale analytics capabilities is the go-to strategy for achieving this growth. Instead of managing your own data center, harnessing the power of the cloud allows your businesses to be more accessible to your users. With the help of a cloud service provider such as Microsoft Azure, you can accelerate your data analytics practice without the limitations of your IT infrastructure. The game has changed in terms of maintaining IT infrastructures, as data lakes and cloud data warehouses are capable of storing and maintaining massive amounts of data.
Simply gathering data does not add value to your business; you need to derive insights from it and help your business grow using data analytics, or it will just be a data swamp. Azure is more than just a hub for gathering data; it is an invaluable resource for data analytics. Data analytics provides you with the ability to understand your business and customers better. By applying various data science concepts, such as ML, regression analysis, classification algorithms, and time series forecasting, you can test your hypotheses and make data-driven decisions for the future. However, one of the challenges that organizations continuously face is how to derive these analytical modeling capabilities quickly when processing billions of data rows. This is where having a modern data warehouse and data pipeline can help (more on this in the next sections).
There are a number of ways in which data analytics can help your business thrive. In the case of retail, if you understand your customers better, you will have a better idea of what products you should sell, where to sell them, when to sell them, and how to sell them. In the financial sector, data analytics is helping authorities fight crime by detecting fraudulent transactions and providing more informed risk assessments based on historical criminal intelligence.
This chapter will cover fundamental topics on the power of data with:
- Big data analytics
- Machine Learning (ML)
- Artificial Intelligence (AI)
You will also learn why Microsoft Azure is the platform of choice for performing analytics on the cloud. Lastly, you will study the fundamental concepts of a modern data warehouse and data pipelines.
The power of data
As a consumer, you have seen how the advent of data has influenced our activities in the daily grind. Most popular entertainment applications, such as YouTube, now provide a customized user experience with features such as video recommendations based on our interests and search history logging information. It is now child's play to discover new content that's similar to our preferred content, and also to find new and popular trending content.
Due to the major shift in wearable technology, it has also become possible to keep track of our health statistics by monitoring heart rates, blood pressure, and so on. These devices then formulate a tailored recommendation based on the averages of these statistics. But these personalized health stats are only a sample of the massive data collection happening every day on a global scale, to which we actively contribute.
Millions of people all over the world use social networking platforms and search engines every day. Internet giants such as Facebook, Instagram, and Google use clickstream data to come up with innovations and improve their services. Data collection is also carried out extensively under projects such as The Great Elephant Census and eBird that aim to boost wildlife conservation. Data-driven techniques have been adopted for tiger conservation projects in India. It even plays an invaluable role in global efforts to compile evidence, causes, and possible responses to climate change—to understand sea surface temperature, analyze natural calamities such as coastal flooding, and highlight global warming patterns in a collective effort to save the ecosystem.
Organizations such as Global Open Data for Agriculture and Nutrition (GODAN), which can be used by farmers, ranchers, and consumers alike, contribute to this tireless data collection as well.
Furthermore (as with the advent of wearable technology), data analysis is contributing to pioneering advancements in the healthcare sector. Patient datasets are analyzed to identify patterns and early symptoms of diseases in order to divine better solutions to known problems.
The scale of data being talked about here is massive—hence, the popular term big data is used to describe the harnessing power of this data at scale.
You can read more about open data https://www.data.gov/.
Big data analytics
The term "big data" is often used to describe massive volumes of data that traditional tools cannot handle. It can be characterized by the five Vs:
- Volume: This indicates the volume of data that needs to be analyzed for big data analytics. We are now dealing with larger datasets than ever before. This has been made possible because of the availability of electronic products such as mobile devices and IoT sensors that have been widely adopted all over the globe for commercial purposes.
- Velocity: This refers to the rate at which data is being generated. Devices and platforms, such as those just mentioned, constantly produce data on a large scale and at rapid speed. This makes collecting, processing, analyzing, and serving data at rapid speeds necessary.
- Variety: This refers to the structure of data being produced. Data sources are inconsistent, having a mix of structured, unstructured, and some semi-structured data (you will learn more about this in the Bringing your data together section).
- Value: This refers to the value of the data being extracted. Accessible data may not always be valuable. With the right tools, you can derive value from the data in a cost-effective and scalable way.
- Veracity: This is the quality or trustworthiness of data. A raw dataset will usually contain a lot of noise (or data that needs cleaning) and bias and will need cleaning. Having a large dataset is not useful if most of the data is not accurate.
Big data analytics is the process of finding patterns, trends, and correlations in unstructured data to derive meaningful insights that shape business decisions. This unstructured data is usually large in file size (images, videos, and social graphs, for instance).
This does not mean that relational databases are not relevant for big data. In fact, modern data warehouse platforms such as Azure Synapse Analytics (formerly known as Azure SQL Data Warehouse) support structured and semi-structured data (such as JSON) and can infinitely scale to support terabytes to petabytes of data. Using Microsoft Azure, you have the flexibility to choose any platform. These technologies can complement each other to achieve a robust data analytics pipeline.
Here are some of the best use cases of big data analytics:
- Social media analysis: Through social media sites such as Twitter, Facebook, and Instagram, companies can learn what customers are saying about their products and services. Social media analysis helps companies to target their audiences by utilizing user preferences and market trends. The challenges here are the massive amount of data and the unstructured nature of tweets and posts.
- Fraud prevention: This is one of the most familiar use cases of big data. One of the prominent features of big data analytics when used for fraud prevention is the ability to detect anomalies in a dataset. Validating credit card transactions by understanding transaction patterns such as location data and categories of purchased items is an example of this. The biggest challenge here is ensuring that the AI/ML models are clean and unbiased. There might be a chance that the model was trained just for a specific parameter, such as a user's country of origin, hence the model will focus on determining patterns on just the user's location and might miss out on other parameters.
- Price optimization: Using big data analytics, you can predict what price points will yield the best results based on historical market data. This allows companies to ensure that they do not price their items too high or too low. The challenge here is that many factors can affect prices. Focusing on just a specific factor, such as a competitor's price, might eventually train your model to just focus on that area, and may disregard other factors such as weather and traffic data.
Big data for businesses and enterprises is usually accompanied by the concept of having an IoT infrastructure, where hundreds, thousands, or even millions of devices are connected to a network that constantly sends data to a server.
Internet of Things (IoT)
IoT plays a vital role in scaling your application to go beyond your current data sources. IoT is simply an interconnection of devices that are embedded to serve a single purpose in objects around us to send and receive data. IoT allows us to constantly gather data about "things" without manually encoding them into a database.
A smartwatch is a good example of an IoT device that constantly measures your body's vital signs. Instead of getting a measuring device and encoding it to a system, a smartwatch allows you to record your data automatically. Another good example is a device tracker for an asset that captures location, temperature, and humidity information. This allows logistics companies to monitor their items in transit, ensuring the quality and efficiency of their services.
At scale, these IoT devices generate anywhere from gigabytes to terabytes of data. This data is usually stored in a data lake in a raw, unstructured format, and is later analyzed to derive business insights. A data lake is a centralized repository of all structured, semi-structured, and unstructured data. In the example of the logistic company mentioned previously, patterns (such as the best delivery routes) could be generated. The data could also be used to understand anomalies such as data leakage or suspected fraudulent activities.
As your data grows in size, it opens a lot of opportunities for businesses to go beyond understanding business trends and patterns. Machine learning and artificial intelligence are examples of innovations that you can exploit with your data. Building your artificial intelligence and ML capabilities is relatively easy now because of the availability of the requisite technologies and the ability to scale your storage and compute on the cloud.
Machine learning and artificial intelligence are terms that are often mixed up. In a nutshell, machine learning is a subset (or application) of artificial intelligence. Machine learning aims to allow systems to learn from past datasets and adapt automatically without human assistance. This is made possible by a series of algorithms being applied to the dataset; the algorithm analyzes the data in near-real-time and then comes up with possible actions based on accuracy or confidence derived from previous experience.
The word "learning" indicates that the program is constantly learning from data fed to it. The aim of machine learning is to strive for accuracy rather than success. There are three main categories of machine learning algorithms: supervised, unsupervised, and reinforcement.
Supervised machine learning algorithms create a mapping function to map input variables with an output variable. The algorithm uses existing datasets to train itself to predict the output. Classification is a form of supervised ML that can be used in applications such as image categorization or customer segmentation, which is used for targeted marketing campaigns.
Unsupervised machine learning, on the other hand, is when you let a program find a pattern of its own without any labels. A good example is understanding customer purchase patterns when buying products. You get inherent groupings (clustering) according to purchasing behaviors, and the program can associate customers and products according to patterns of purchase. For instance, you may discern that customers who buy Product A tend to buy Product B too. This is an example of a user-based recommendation algorithm and market-based analysis. What it would eventually mean for users is that when they buy a particular item, such as a book, the user is also encouraged to buy other books that belong to the same series, genre, or category.
Reinforcement Learning (RL) provides meaningful insights and actions based on rewards and punishment. The main difference between this and supervised learning is that it does not need labeled input and output as part of the algorithm. An excellent example of this is the new financial trend for "robo-advisors." Robo-advisors run using agents that get rewarded and punished based on their stock performance (that is, gains and losses). In time, the agent can recognize whether to hold, buy, or sell stocks. This has been a game-changer because, in the past, analysts had to make every single decision; now most of the complicated data trends are already analyzed for you and analysts can choose to listen to the agent or not. However, financial trading is very complex given the nature of parameters present in the world, and so not all robo-advisors' predictions are accurate.
Artificial intelligence (AI)
Artificial intelligence extends beyond what machine learning can do. It is about making decisions and aiming for success rather than accuracy. One way to think of it is that machine learning aims to gain knowledge while artificial intelligence aims for wisdom or intelligence. An example of AI in action would be Boston Dynamic's Atlas robot, which can navigate freely in the open world and avoid obstacles without the aid of human control. The robot does not fully depend on the historical map data to navigate. However, for machine learning, it's about creating or predicting a pattern from historical data analysis. Similar to the robot's navigation, it is about understanding the most optimal route by creating patterns based on historical and crowd-sourced traffic data.
Setting up a modern data warehouse with cloud analytics is the key factor in preparing to execute ML/AI. Without migrating the workloads to the cloud, deriving ML/AI models will mean encountering various roadblocks in order to maximize the business value of these emerging technologies. A modern data warehouse and analytics pipeline form the backbone that enables you to pass these roadblocks.
Microsoft is a leader in machine learning and artificial intelligence as they have been driving a lot of innovation throughout their products and tools—for instance, Window's digital assistant, Cortana, and Office 365's live captions and subtitles. They offer a range of products, tools, and services such as Microsoft Cognitive Services, Azure Machine Learning studio, the Azure Machine Learning service, and ML.NET.
Microsoft is setting an example with their AI for Good initiative, which aims to make the world more sustainable and accessible through AI. One particularly interesting project is AI for Snow Leopards, in which Microsoft uses AI technology to detect snow leopards (who are almost invisible in snow) in order to protect the endangered species. Exploring artificial intelligence and deep learning (the ability to learn without human supervision), specifically the data science and formula aspects, is not the focus of this book, but you will tackle some concepts in later chapters (see more on this in Chapter 3, Processing and visualizing data).
In order to be efficient and agile with implementing data analytics in your company, you need the right culture and processes. This is where the concept of DataOps comes in. DataOps removes the co-ordination barrier between data (analysts, engineers, and scientists) and operations (administrators and operations managers) teams in order to achieve speed and accuracy in data analytics.
DataOps is about a culture of collaboration between different roles and functions. Data scientists have access to real-time data to explore, prepare, and serve results. Automated processes and flows prove invaluable to this collaborative effort between analysts and developers, as they provide easy access to data through visualization tools. Relevant data should be served to end users via web or mobile applications; this is usually possible with an Application Programming Interface (API). For CEOs, DataOps means faster decision-making, as it allows them to monitor their business at a high level without waiting for team leaders to report. Figure 1.1 tries to explain the idea of a collaborative DataOps culture:
Figure 1.1: The DataOps process
Once a team attains the desired speed and accuracy in testing their hypotheses (such as the likelihood of someone buying a product based on their characteristics and behavior), they are able to derive better insights. Once there are better insights, there are more actionable and reasonable decision points for business stakeholders that minimize risks and maximize profits.
Why Microsoft Azure?
Microsoft Azure is an enterprise-grade set of cloud computing services created by Microsoft using their own managed data centers. Azure is the only cloud with a true end-to-end analytics solution. With Azure, analysts can derive insights in seconds from all enterprise data. Azure provides a mature and robust data flow without limitations on concurrency.
Azure supports Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and SaaS. Many government institutions across the world, as well as 95% of Fortune 500 companies, use Azure, ranging from industries such as healthcare and financial services to retail and manufacturing.
Microsoft is a technology conglomerate that has empowered many people to achieve more with less for decades with their software, tools, and platforms. Azure provides flexibility. Familiar Microsoft tools and infrastructures (such as SQL Server, Windows Server, Internet Information Services (IIS), and .NET) or tools such as MySQL, Linux, PHP, Python, Java, or any other open source technologies can all run on the Azure cloud. Gone are the days when you could only work on a walled-garden set of tools and technologies.
Azure provides you with various products and services, depending on your needs. You have the option of doing everything in a bespoke way, from managing your IaaS by spinning up Windows Server virtual machines with Enterprise SQL Server installed, to using a managed PaaS offering such as Azure Synapse Analytics.
Figure 1.2 shows the wide range of data-specific Azure tools and services that can be used to create end-to-end data pipelines:
Figure 1.2: Microsoft Azure data-related services
Azure grants you the flexibility to choose the best approach to solve a problem for yourself, rather than being forced to bend a less adaptable product to perform an unnatural function. You're not just limited to SQL Server, either. You also have the flexibility to choose other types of databases or storage, whether through a service installed on a Linux server or containerized solution, or a managed platform (such as Azure Cosmos DB for your Cassandra and MongoDB instances). This is very important because, in the real world, different scenarios require different solutions, tools, and products.
Microsoft Azure provides you with an end-to-end platform, from Azure Active Directory for managing your user identity and access to Azure IoT offerings (such as IoT Hub) for gathering data from hundreds and thousands of IoT devices. It also provides services such as development tools and cloud hosting options for getting your developers up to speed, as well as various analytics and machine learning tools that enable data scientists, data engineers, and data analysts to be more productive (more on this in Chapter 3, Processing and visualizing data).
The full spectrum of Azure services is too wide to cover here, so instead, this book will focus on the key data warehousing and business intelligence suite of products: Azure Data Lake, Azure Synapse Analytics, Power BI, and Azure Machine Learning.
Microsoft views security as the top priority. When it comes to data, privacy and security are non-negotiable; there will always be threats. Azure has the most advanced security and privacy features in the analytics space. Azure services support data protection through Virtual Networks (VNets) so that, even though they are in the cloud, data points cannot be accessed by the public internet. Only the users in the same VNet can communicate with each other. For web applications, you get a Web Application Firewall (WAF) provided by Azure Application Gateway, which ensures that only valid requests can get into your network.
With role-based access control (authorization), you can ensure that only those with the right roles, such as administrators, have access to specific components and the capabilities of different resources. Authentication, on the other hand, ensures that if you don't have the right credentials (such as passwords), you will not be able to access a resource. Authorization and authentication are built into various services and components of Microsoft Azure with the help of Azure Active Directory.
Azure also provides a service called Azure Key Vault. Key Vault allows you to safely store and manage secrets and passwords, create encryption keys, and manage certificates so that applications do not have direct access to private keys. By following this pattern with Key Vault, you do not have to hardcode your secrets and passwords in your source code and script repository.
Azure Synapse Analytics uses ML and AI to protect your data. In Azure SQL, Microsoft provides advanced data security to ensure that your data is protected. This includes understanding if your database has vulnerabilities, such as port numbers that are publicly available. These capabilities also allow you to be more compliant with various standards, such as General Data Protection Regulation (GDPR), by ensuring that customer data that are considered sensitive are classified. Azure SQL also recently announced their new features, row-level security (RLS) and column-level security (CLS), to control access to rows and columns in a database table, based on the user characteristics.
Microsoft invests at least $1 billion each year in the cybersecurity space, including the Azure platform. Azure holds various credentials and awards from independent assessment bodies, which means that you can trust Azure in all security aspects, from physical security (such that no unauthorized users can get physical access to data centers) to application-level security.
These are a few security features that you need to consider if you are maintaining your own data center.
Azure changed the industry by making data analytics cost-efficient. Before the mass adoption of cloud computing, in order to plan for data analytics with terabytes, or even petabytes, of data, you needed to properly plan things and ensure that you had the capital expenditure to do it. This would mean very high upfront infrastructure and professional services costs, just to get started. But with Azure, you can start small (many of the services have free tiers). You can scale your cloud resources effortlessly up or down, in or out, in minutes. Azure has democratized scaling capability by making it economically viable and accessible for everyone, as shown in Figure 1.3:
Figure 1.3: Microsoft Azure regions
Microsoft Azure currently has over 60 data center regions supporting over 140 countries. Some enterprises and business industries require that your data is hosted within the same country as business operations. With the availability of different data centers worldwide, it is easy for you to expand to other regions. This multi-region approach is also beneficial in terms of making your applications highly available.
The true power of the cloud is its elasticity. This allows you to not only scale resources up but also scale them down when necessary. In data science, this is very useful because data science entails variable workloads. When data scientists and engineers are analyzing a dataset, for instance, there is a need for more computation. Azure, through services such as Azure Machine Learning (more on this in Chapter 3, Processing and visualizing data), allows you to scale according to demand. Then, during off-peak times (such as weekends, and 7 PM to 7 AM on weekdays), when the scientists and engineers don't need the processing power to analyze data, you can scale down your resources so that you don't have to pay for running resources 24/7. Azure basically offers a pay-as-you-go or pay-for-what-you-use service.
Azure also provides a Service Level Agreement (SLA) for their services as their commitments to ensure uptime and connectivity for their production customers. If downtime or an incident occurs, they will apply service credits (rebates) to the resources that were affected. This will give you peace of mind as your application will always be available with a minimal amount of downtime.
There are different scaling approaches and patterns that Microsoft Azure provides:
- Vertical scaling: This is when more resources are added to the same instance (server or service). An example of this is when a virtual machine is scaled up from 4 GB of RAM to 16 GB of RAM. This is a simple and straightforward approach to take when your application needs to scale. However, there is a technical maximum limit on how much an instance can be scaled up, and it is the most expensive scaling approach.
- Horizontal scaling: This is when you deploy your application to multiple instances. This would logically mean that you can scale your application infinitely because you don't use a single machine to perform your operations. This flexibility also introduces some complexities. These complexities are usually addressed by using various patterns and different orchestration technologies, such as Docker and Kubernetes.
- Geographical scaling: This is when you scale your applications to different geographical locations for two major reasons: resilience and reduced latency. Resilience allows your application to freely operate in that region without all resources being connected to a master region. Reduced latency would mean users of that region can get their web requests faster because of their proximity to the data center.
- Sharding: This is one of the techniques for distributing huge volumes of related, structured data onto multiple independent databases.
- Development, Testing, Acceptance, and Production (DTAP): This is the approach of having multiple instances living in different logical environments. This is usually done to separate development and test servers from staging and production servers. Azure DevTest Labs offers a development and testing environment that can be configured with group policies.
Another advantage of your business being in the cloud is the availability of your services. With Azure, it is easier to make your infrastructure and resources geo-redundant—that is, available to multiple regions and data centers across the world. Say you want to expand your business from Australia to Canada. You can achieve that by making your SQL Server geo-redundant so that Canadian users do not need to query against the application and database instance in Australia.
Azure, despite being a collective suite of products and service offerings, does not force you to go "all in." This means that you can start by implementing a hybrid architecture of combined on-premises data centers and cloud (Azure). There are different approaches and technologies involved in a hybrid solution, such as using Virtual Private Networks (VPNs) and Azure ExpressRoute, if you need dedicated access.
With Azure Synapse Analytics, through data integrations, Azure allows you to get a snapshot of data sources from your on-premises SQL Server. The same concept applies when you have other data sources from other cloud providers or SaaS products; you have the flexibility to get a copy of that data to your Azure data lake. This flexibility is highly convenient because it does not put you in a vendor lock-in position where you need to do a full migration.
Top business drivers for adopting data analytics in the cloud
Different companies have different reasons for adopting data analytics using a public cloud such as Microsoft Azure. But more often than not, it boils down to three major reasons: rapid growth and scale, reducing costs, and driving innovation.
Rapid growth and scale
Enterprises and businesses need to rapidly expand their digital footprint. With the rapid growth of mobile applications—particularly media types (such as images and videos), IoT sensors, and social media data—there is just so much data to capture. This means enterprises and businesses need to scale their infrastructure to support these massive demands. Company database sizes continuously grow from gigabytes of data to terabytes, or even petabytes, of data.
End users are more demanding now than ever. If your application does not respond within seconds, the user is more likely to disengage with your service or product.
Scaling does not only apply to the consumers of the applications; it is also important for data scientists, data engineers, and data analysts in order to analyze a company's data. Scaling an infrastructure is vital, as you cannot expect your data engineers to handle massive chunks of data (gigabytes to terabytes) and run scripts to test your data models on a single machine. Even if you do serve this in a single high-performance server instance, it's going to take weeks or days for it to finish the test, not to mention the fact that it's going to cause performance bottlenecks for the end users who are consuming the same database.
With a modern data warehouse like Azure Synapse Analytics, you have some managed capabilities to scale, such as a dedicated caching layer. Caching will allow analysts, engineers, and scientists to query faster.
Due to scaling demands, enterprises and businesses need to have a mechanism to expand their data infrastructure in a cost-effective and financially viable way. It is too expensive to set up an on-premises data warehouse. The following are just a few of the cost considerations:
- The waiting time for server delivery and associated internal procurement processes
- Networking and other physical infrastructure costs, such as hardware cooling and data center real estate
- Professional services costs associated with setting up and maintaining these servers
- Licensing costs (if any)
- The productivity lost from people and teams who cannot ship their products faster
With a modern data warehouse, you can spin up new high-performance servers with high-performance graphics cards on demand. And with the use of a cloud provider such as Microsoft Azure, you will only need to pay for the time that you use these servers. You can shut them down if you don't need them anymore. Not only can you turn them off on demand, but if it turns out that a particular service is not suitable for your requirements, you can delete these resources and just provision a different service.
Azure also provides a discount for "reserved" instances that you are committing to use for a specific amount of time. These are very helpful for those databases, storage solutions, and applications that need to be up 24/7 with minimal downtime.
Companies need to constantly innovate in this very competitive market, otherwise someone else will rise up and take the market share. But obviously, no one can predict the future with 100% accuracy; hence, companies need to have a mechanism to explore new things based on what they know.
One good example of this is the Business Process Outsourcing (BPO) and telecommunications (telco) industries, where there are petabytes of data that may not have been explored yet. With Microsoft Azure's modern data warehouse, actors in such industries can have the infrastructure to do data exploration. With Azure Synapse Analytics, Power BI, and Azure Machine Learning, they can explore their data to drive business decisions. Maybe they can come up with a data model that can detect fraudulent actions or better understand their customer preferences and expectations to improve satisfaction ratings. With advanced analytics, these companies can come up with decisions that are relevant today (and possibly in the future) and are not just restricted to analyzing historical data.
What if you want to create an autonomous vehicle? You will need a robust data warehouse to store your datasets and a tremendous amount of data processing. You need to capture massive amounts of data—whether through pictures or videos that the car is continuously capturing—and need to come up with a response almost instantly based on your dataset and algorithms.
Using a cloud provider such as Microsoft Azure would allow you to test and validate your ideas early on, without a massive investment. With various Azure services and related tools such as GitHub and Visual Studio, you can rapidly prototype your ideas and explore possibilities. What if it turns out that the product or service that you or your team is working on does not really gain traction? If you are doing this on-premises, you will still have high liability and operations costs since you physically own the infrastructure, in addition to any associated licensing and services costs.
Why do you need a modern data warehouse?
A data warehouse is a centralized repository that aggregates different (often disparate) data sources. The main difference between a data warehouse and a database is that data warehouses are meant for Online Analytical Processing (OLAP) and databases, on the other hand, are intended for Online Transaction Processing (OLTP). OLAP means that data warehouses are primarily used to generate analytics, business intelligence, and even machine learning models. OLTP means that databases are primarily used for transactions. These transactions are the day-to-day operations of applications, which concurrently read and write data to databases.
A data warehouse is essential if you want to analyze your big data as it also contains historical data (often called cold data). Most data that's stored has legacy information, such as data stored 5 years ago, 10 years ago, or even 15 years ago. You probably don't want the same database instance that your end users are querying against to also contain that historical data, as it might affect your performance when at scale.
Here are some of the advantages of having a modern data warehouse:
- Supports any data source
- Highly scalable and available
- Provides insights from analytical dashboards in real-time
- Supports a machine learning environment
Microsoft offers the following tools and services that collectively create a modern data warehouse:
Figure 1.4: Modern data warehouse using Azure Synapse Analytics
There are a lot of emerging patterns and architectures for data warehousing, but the most popular ones are those that support the separation of duties and responsibilities in different phases of the data pipeline (more on this in the Creating a data pipeline section).
In order to understand what it means for a data warehouse to be modern, you first need to understand how you create and manage a traditional one. It boils down to two major concepts:
- Compute: This refers to the ability to process the data and make sense out of it. It can be in the form of a database query to make the results accessible to another interface, such as web applications.
- Storage: This refers to the ability to keep data in order for it to be accessible at any time in the future.
A modern data warehouse separates compute and storage in cost-effective ways. Unlike the case traditionally with SQL Server and SQL Server Integration Services (SSIS), the pricing model involves both the storage capacity and computing power to analyze data. Azure is the first cloud provider to offer a data warehouse that separates compute and storage.
Another change in pattern is that the traditional Extract-Transform-Load (ETL) model of data warehousing has now changed to Extract-Load-Transform (ELT). In the traditional ETL model, analysts are accustomed to waiting for the data to be transformed first, since they don't have direct access to all data sources. In a modern data warehouse, massive amounts of data can be stored in either a data lake or data warehouse, and can be transformed anytime by analysts without the need to wait for data engineers or database admins to serve the data.
Of course, there are more factors to consider in order to modernize your data warehouse, such as extensibility, disaster recovery, and availability. However, this section will focus on compute for the time being.
Bringing your data together
In the past, databases were often the only source of data for your applications. But nowadays, you have hundreds and thousands of different data sources. The data coming from these different sources is of different data types—some structured, some unstructured, some semi-structured.
Structured data: The word "structured" suggests that there is a pattern that can be easily interpreted. This usually comes with a predefined set of models and a schema. A relational database management system (RDBMS) such as Microsoft SQL Server is a common example of a data storage solution that is structured. This is because it comes with a database schema and table columns that define the data that you are storing.
Here are some examples of structured data types:
- Customer names
- Date and time
- Mobile and phone numbers
- Credit card numbers
- Product names and Stock Keeping Units (SKUs)
- General transaction information such as "From" and "To" with time stamps and amount values
A good example of structured data is the information provided by the users when signing up to an application for the first time. They are presented with a form that needs to be filled in. Once that person clicks the submit button, it sends the data to a database and inserts it into a user table with predefined columns: names, addresses, and other details. This will then allow the user to log into the application since the system can now look up the existing record for the registered user in the database.
From there, a user can access the application and perform transactions, such as transferring money and assets. In time, users will generate a series of transactions that will eventually make your database larger. Your database schema will also expand to support different business requirements.
Once you have enough data, you can perform data exploration. This is where you start looking for patterns in data. You may identify fraudulent transactions and test hypotheses by analyzing large and repeated transaction amounts from the same user.
Your data exploration is limited because you can only base it on a dataset that is structured and with a semantic form. What if you also want to consider other data sources that are unstructured, such as free-form text? An example is a transaction description, which may state the nature or the recipient of the transaction. You don't want to manually read each transaction description and insert it in the right column of a database table. You probably want to extract only the relevant information and transform it into a structured format. This is where unstructured data comes in.
Unstructured data: This data type, more or less, is the rest—that is, everything that isn't structured data. This is mainly because you are not limited to any storage and data type.
Unstructured data types usually don't have a predefined data model that can fit directly into a database. Unstructured data can be text-heavy and is usually read per line or is space-separated.
Here are some examples of unstructured data sources:
- Image files
- Email messages and documents
- Log files
- IoT devices and sensors
- NoSQL databases such as MongoDB
- Social media and Microsoft Graph
Image files and videos are classified as unstructured data because of their dynamic nature. Although their metadata is something you can consider as structured (such as title, artist, filename, and so on), the content itself is unstructured. With modern tools and data analytics technology, you can now examine this data and make sense of it. The usual example is face recognition in either images or videos.
Emails, documents, and log files all have metadata, but what you're actually more interested in is the content of those files. Usually, in emails, documents, and log files, data is separated per line and the messages are unstructured. You would want to describe the content without manually reading everything (which could be hundreds or even millions of files). An example is doing sentiment analysis on content to determine whether the prevailing emotion is happy, sad, or angry. For log files, you probably want to separate the error messages, time stamps (dates), and measurements (traces) between messages.
IoT devices and sensors, similarly to log files, are used to capture measurements and errors about a certain item. The main difference is that these devices usually work on a large number of clusters (hundreds to thousands of devices) and continuously stream data. Data generated from these devices is semi-structured or unstructured since it is in JSON or XML format. Modern technologies, such as Azure IoT services, already solve these complexities with services such as Azure IoT Hub, which aggregates all this data from various sensors and continuously exports it to a data source. Sometimes you can classify this data as semi-structured since these traces and logs are things that a system can easily understand.
Social media platforms and Microsoft Graph both provide semi-structured data. It is classified this way because just querying all of Twitter's tweets about a topic is not enough. The results don't really make a lot of sense until you do some analysis of them. The primary focus is to discern patterns and anomalies. For example, you may want to identify trends about news and topics but also want to remove data that is irrelevant, such as tweets coming from fake accounts.
Interestingly, some line-of-business (LOB) applications provide both structured and unstructured data. For example, both Microsoft Dynamics CRM and Salesforce provide structured data that can easily be interpreted and exported to your SQL database tables, such as data for products and their amounts and value. However, they also support unstructured data such as images, videos, and text notes. Note that even though text notes are considered as the string data type, they can still be considered as unstructured data because they are designed to be free text. They don't have a proper format to follow, but they are still worth exploring. A common scenario for unstructured data its use is to understand why sales were not successful.
Creating a data pipeline
Once you have identified your data sources, the next step is to create a data pipeline (sometimes also referred to as a data flow). At a high level, the steps involved are data ingestion, data storage, data preparation and training, data modeling and serving, and data visualization.
With this approach, you will build a highly scalable architecture that serves all the users of the system: from end users, data engineers and scientists who are doing the data exploration, and analysts who interpret the data for the business, to even the CEO if they want to see what's happening with the business in real-time:
Figure 1.5: Modern data pipeline
Data ingestion means transferring data (structured, semi-structured, or unstructured) from the source to your storage, data lake, or data warehouse.
This would involve something such as Azure Synapse Analytics using data integration to transfer data from various sources such as on-premises databases and SaaS products to a data lake. This step allows you to manage your ETL and ELT workflows without the need for manual reconciliation.
This is not a one-time process. Ideally, this is something you schedule or set to be triggered so that your data lake gets a historical snapshot from time to time. An example of this is a connection from your CRM tools, such as Microsoft Dynamics CRM, to Azure Data Lake by means of Azure Synapse Analytics with data integration. This will allow data scientists and data engineers to explore this data at different time intervals without interrupting the actual CRM application.
Once data has been ingested from various data sources, all the data is stored in a data lake. The data residing within the lake will still be in a raw format and includes both structured and unstructured data formats. At this point, the data won't bring much value to drive business insights.
Data pipeline orchestration and monitoring
In a modern data warehouse scenario, it is very important that data sources and services efficiently transfer data from source to destination. Azure Synapse Analytics with data integration is an orchestrator that allows services to perform data migrations or transfers. It is not the thing performing the actual transfer, but rather instructs a service to perform it—for example, it can tell a Hadoop cluster to perform a Hive query.
Azure Synapse Analytics with data integration also allows you to create alerts and metrics to notify you when the service orchestration is working. You can create an alert via email for when a data transfer from source to destination was not successful.
In a modern data warehouse pattern, sharing data should be both seamless and secure. Often, this can be done via File Transport Protocol (FTP), emails, or APIs, just to name a few. There is a big management overhead if you want to share data at scale. Azure Data Share allows you to securely manage and share your big data to other parties and organizations. The data provider will have full control of who can access the datasets and the permissions each can perform. This makes it easier for dependent companies to derive insights and explore AI scenarios.
Once data is ingested, the next step is data preparation. This is a phase where the data from different data sources is pre-processed for data analytics purposes. An example of this is querying data from an API and inserting them into a database table. Azure Synapse Analytics with data integration allows you to orchestrate this data preparation. Azure Synapse Analytics through a hosted Apache Spark instance can also help with data preparation, as it can run clusters concurrently to process massive amounts of data in just a matter of seconds or minutes.
Data transform, predict, and enrich
Sometimes, data preparation requires further changes beyond a simple copy-and-paste scenario. This is where data transformation comes in. There are instances wherein you want to apply custom logic in the raw data first—applying filters, for instance—before you decide to transfer it to a data warehouse. Azure Synapse Analytics (through data integration), Apache Spark, and SQL Analytics can also help in this scenario. If data in a data lake is not properly transformed into meaningful insights, it will eventually become a data swamp.
Furthermore, you can enrich the batch data at scale by invoking Azure Machine Learning, which makes real-time predictions about data. This can be an added feature in your data pipeline in Azure Synapse Analytics. To learn more about Azure Machine Learning, see Chapter 3, Processing and visualizing data.
After preparing and training your data, you'll be ready to model and serve it to the consumers. Basically, in this phase, you are modeling the data to be easily understood by systems. This usually involves performing the complex queries you generated from the data preparation and training phase and inserting these records into a database so that the data is structured in a defined table and schema.
All of your company's analytical data is stored in a data warehouse. You potentially have hundreds to thousands of concurrent users, reports, and dashboards running off a single data warehouse.
You usually perform data modeling and service integrations with a data warehouse platform such as Azure Synapse Analytics. Completing complex queries can take hours or days. But with the power of the cloud, you can scale your Azure Synapse Analytics to perform these queries faster, making days into hours and hours into minutes.
Data visualization is an efficient way of analyzing performance through graphs and charts. This is called business intelligence. Tools such as Power BI help analysts to get the most out of data. Working with Azure, you're not just limited to Power BI, but can also use other visualization services such as Tableau. Data visualization provides a rich and meaningful representation of your data that adds business value for you and your customers. The team can see trends, outliers, and patterns that help in making data-driven decisions.
Various stakeholders within the organization can collaborate after analyzing the different performance parameters. Is your company selling products well? In what regions do you get most of your sales? With rich data backing up your assumptions, business stakeholders, such as CEOs can make reasonable data-driven decisions to minimize risks. What product lines should you expand? Where should you expand further? These are some of the common questions that you can answer once you have richer data analytics.
Analysts can use desktop or web application tools to create meaningful representations of their data. Here is an example of a desktop view of Power BI where a user can analyze their company's data and visualize it in graphs:
Figure 1.6: Power BI Desktop-dashboard
Once the reports are generated, they can be exported to a workspace where people can work together to improve the reports. Here is an example view of the same report in a mobile application. Users can add comments and annotations to the report, allowing a faster feedback loop for analysts:
Figure 1.7: Power BI dashboard for mobile
Power BI allows you to create rich personalized dashboards that best suit your requirements and brand. Compared to having presentations with a snapshot of a graph from last week or last month, this mechanism allows you to refresh the same report over and over again.
Machine learning has helped companies build applications and products such as chatbots that perform specific tasks for end users without the need for human intervention. Some common examples are voice assistants, such as Cortana, which actively learn to help us become more productive with our day-to-day tasks.
Other examples are online games in which you can easily track your performance against everyone in the world. You can see how you rank against other players, what areas you excel in, and how you can improve.
The amount of tasks you can perform with rich data is virtually limitless, but in order to perform them, you need to have the right approach and infrastructure to handle a high level of scaling.
This chapter established the importance of data analytics. It also highlighted several reasons why Microsoft Azure is an ideal platform for achieving business intelligence capabilities in the cloud. It touched on some fundamental concepts around big data, ML, and DataOps. You also learned about some of the business drivers for adopting data analytics on the cloud. Lastly, you gained a high-level view of what it takes to have a modern data warehouse.
In the next chapter, you will see how to start building a modern data warehouse with Azure Synapse Analytics and related technologies.