Introduction to Data Architectures
With data quickly becoming an essential asset of any business, the need for cloud data architects has never been higher. The key role these professionals fulfill is to provide the technical blueprints of any cloud data project and expertise on data architectures as a whole. A skilled data architect is proficient in many steps of the end-to-end data processes, such as data ingestion, data warehouses, data transformations, and visualization.
It is of utmost importance that data architects are familiar with the benefits and drawbacks of individual resources as well as platform-wide design patterns. Typically, aspiring data architects have a background as business intelligence (BI) developers, data engineers, or data scientists. They are often specialized in one or more tools but lack experience in architecting solutions according to best practices.
Compared to a developer profile, an architect is more focused on the long term and the bigger picture. The architect must keep in mind the overarching business strategy and prioritize certain aspects of the architecture accordingly. To equip you with the necessary skills to do so, you will be introduced to methods of getting business value from your data, to solidify any long-term data strategy.
This chapter will also introduce you to a wide-purpose referential data architecture. This architecture will be used as a guideline throughout this entire book and will become more and more defined as the chapters go on.
Finally, on-premises data architectures nowadays face a variety of challenges. You will explore these challenges and look at how a business can benefit from either a cloud or a hybrid cloud solution.
In this chapter, we’re going to cover the following main topics:
- Understanding the value of data
- A data architecture reference diagram
- Challenges of on-premises architectures
Understanding the value of data
Just to put this number in perspective, 1 zettabyte is equal to 1 million petabytes. This scale requires data architects to deal with the complexity of big data, but it also introduces an opportunity. The expert data analyst, Doug Laney, defines big data with the popular three Vs framework: Volume, Variety, and Velocity. In this section, we would like to explore a fourth one called Value.
Types of analytics
Data empowers businesses to look back into the past, giving insights into established and emerging patterns, and making informed decisions for the future. Gartner splits analytical solutions that support decision-making into four categories: descriptive, diagnostic, predictive, and prescriptive analytics. Each category is potentially more complex to analyze but can also add more value to your business.
Let’s go through each of these categories next:
- Descriptive analytics is concerned with answering the question, “What is happening in my business?” It describes the past and current state of the business by creating static reports on top of data. The data used to answer this question is often modeled in a data warehouse, which models historical data in dimension and fact tables for reporting purposes.
- Diagnostic analytics tries to answer the question, “Why is it happening?” It drills down into the historical data with interactive reports and diagnoses the root cause. Interactive reports are still built on top of a data warehouse, but additional data may be added to support this type of analysis. A broader view of your data estate allows for more root causes to be found.
- Predictive analytics learns from historical trends and patterns to make predictions for the future. It deals with answering the question, “What will happen in the future?” This is where machine learning (ML) and artificial intelligence (AI) come into play, drawing data from the data warehouse or raw data sources to learn from.
- Prescriptive analytics answers the question, “What should I do?” and prescribes the next best action. When we know what will happen in the future, we can act on it. This can be done by using different ML methods such as recommendation systems or explainable AI. Recommendation systems recommend the next best product to customers based on similar products or what similar customers bought. Think, for instance, about Netflix recommending new series or movies you might like. Explainable AI will identify which factors were most important to output a certain prediction, which allows you to act on those factors to change the predicted outcome.
The following diagram shows the value-extracting process, going from data to analytics, decisions, and actions:
Figure 1.1 – Extracting value from data
Just as with humans, ML models need to learn from their mistakes, which can be done with the help of a feedback loop. A feedback loop allows a teacher to correct the outcomes of the ML model and add them as training labels for the next learning cycle. Learning cycles allow the ML model to improve over time and combat data drift. Data drift occurs when the data on which the model was trained isn’t representative anymore of the data the model predicts. This will lead to inaccurate predictions.
As ML models improve over time, it is best practice to have human confirmation of predictions before automating the decision-making process. Even when an ML model has matured, we can’t rely on the model being right 100 percent of the time. This is why ML models often work with confidence scores, stating how confident they are in the prediction. If the confidence score is below a certain threshold, human intervention is required.
To get continuous value out of data, it is necessary to build a data roadmap and strategy. A complexity-value matrix is a mapping tool to help prioritize which data projects need to be addressed first. This matrix will be described more in detail in the following section.
A complexity-value matrix
A complexity-value matrix has four quadrants to plot future data projects on. These go from high- to low-value and low- to high-complexity. Projects that are considered high-value and have a low complexity are called “quick wins” or “low-hanging fruit” and should be prioritized first. These are often Software-as-a-Service (SaaS) applications or third-party APIs that can quickly be integrated into your data platform to get immediate value. Data projects with high complexity and low value should not be pursued as they have a low Return on Investment (ROI). In general, the more difficult our analytical questions become, the more complex the projects may be, but also, the more value we may get out of it.
A visualization of the four quadrants of the matrix can be seen as follows:
Figure 1.2 – The four quadrants of a complexity-value matrix
Often, we think of the direct value data projects bring but do also consider the indirect value. Data engineering projects often do not have a direct value as they move data from one system to another, but this may indirectly open up a world of new opportunities.
To extract value from data, a solid data architecture needs to be in place. In the following section, we’ll define an abstract data architecture diagram that will be referenced throughout this book to explain data architecture principles.
A data architecture reference diagram
Figure 1.3 – A typical structure of an end-to-end data platform in a (hybrid) cloud
This reference diagram shows the key components of most modern cloud data platforms. There are limitless possible adaptations, such as accommodating streaming data, but the diagram in Figure 1.3 serves as the basis for more advanced data architectures. It’s like the Pizza Margherita of data architectures! The architecture diagram in Figure 1.3 already shows four distinct layers in the end-to-end architecture, as follows:
- The ingestion layer
- The storage layer
- The serving layer
- The consumption layer
- Data orchestration and processing
- Advanced analytics
- Data governance and compliance
Let’s cover the first layer next.
The ingestion layer
The ingestion layer serves as the data entrance to the cloud environment. Here, data from various sources is pulled into the cloud. These sources include on-premises databases, SaaS applications, other cloud environments, Internet of Things (IoT) devices, and many more. Let’s look at this layer in more detail:
- First, the number of data sources can vary greatly between businesses and could already bring a variety of challenges to overcome. In enterprise-scale organizations, when the amount of data sources can reach extraordinary levels, it is of exceptional importance to maintain a clear overview and management of these sources.
- Secondly, the sheer variety of sources is another common issue to deal with. Different data sources can have distinct methods of ingesting data into the cloud and, in some cases, require architectural changes to accommodate.
- Thirdly, managing authentication for data sources can be cumbersome. Authentication, which happens in a multitude of ways, is often unique to the data source. Every source requires its own tokens, keys, or other types of credentials that must be managed and seamlessly refreshed to optimize security.
From a design perspective, there are a few other aspects to keep in mind. The architect should consider the following:
- Data speed: Will incoming data from the source be ingested periodically (that is, batch ingestion) or continuously (that is, data streaming)?
- Level of the structure of the data: Will the incoming data be unstructured, semi-structured, or structured?
Regarding data speed, data will be ingested in batches in the vast majority of cases. This translates to periodical requests made to an application programming interface (API) to pull data from the data source. For the more uncommon cases of streaming data, architectural changes are required to provide an environment to store and process the continuous flow of data. In later chapters, you will discover how the platform architecture will differ to accommodate the streaming data.
Finally, the level of structure of the data will determine the amount of required data transformations, the methods of storing the data, or the destination of data movements. Unstructured data, such as images and audio files, will require different processing compared to semi-structured key-value pairs or structured tabular files.
(Add what data ingestion services will be discussed later in the book).
The storage layer
The definitions of the following layers can vary. Over the course of this book, the storage layer refers to the central (often large-scale) storage of data. Data lakes are the most common method for massive storage of data, due to their capacity and relatively low cost. Alternatives are graph-based databases, relational databases, NoSQL databases, flat file-based databases, and so on. The data warehouse, which holds business-ready data and is optimized for querying and analytics, does not belong to the storage layer but will fall under the serving layer instead.
Decisions made by the architect in the storage layer can have a great effect on costs, performance, and the data platform in its entirety. Here, the architect will have to consider redundancy, access tiers, and security. In the case of a data lake, a tier system needs to be considered for raw, curated, and enriched data, as well as a robust and scalable folder structure.
(Add what data storage services will be discussed later in the book).
The serving layer
In the serving layer, preprocessed and cleansed data is stored in a data warehouse, often regarded as the flagship of the data platform. This is a type of structured storage that is optimized for large-scale queries and analytics. The data warehouse forms one of the core components of BI.
The major difference between a data warehouse and the aforementioned data lake is the level of structure. A data warehouse is defined by schemas and enforces data types and structures. Conversely, a data lake can be seen as a massive dump of all kinds of data, with little to no regard for the enforcement of specific rules. The strong level of enforcement makes a data warehouse significantly more homogeneous, which results in far better performance for analytics.
The cloud data architect has various decisions to make in the serving layer. There are quite a few options for data warehousing on the Azure cloud, as follows:
- First, the architect should think about whether they want an Infrastructure-as-a-Service (IaaS), a Platform-as-a-Service (PaaS), or a SaaS solution. In short, this results in a trade-off between management responsibilities, development efforts, and flexibility. This will be discussed more in later chapters.
- Next, different services on Azure come with their own advantages and disadvantages. The architect could, for example, opt for a very cost-effective serverless SQL solution or leverage massive processing power in highly performant dedicated SQL pools, among numerous other options.
After deciding on the most fitting service, there are still decisions to be made within the data warehouse. The architect will have to determine structures to organize the data in the data warehouse, also known as schemas. Common schemas are star and snowflake schemas, which also come with their own benefits and drawbacks.
Chapter 6, Data Warehousing, will teach you all the necessary skills to confidently decide on the right solution. Chapter 7, The Semantic Layer, will introduce you to the concept of data marts, subsets of a data warehouse ready for business consumption.
The consumption layer
The consumption layer is the final layer of an end-to-end data architecture and typically follows the serving layer by extracting data from the data warehouse. There are numerous ways of consuming the data, which has been prepared and centralized in earlier stages.
The most common manner of consumption is through data visualization. This can happen through dashboarding and building reports. The combination of a data warehouse and a visualization service is often referred to as BI. Many modern dashboarding tools allow for interactivity and drill-down functionality within the dashboard itself. Although technically it is not a part of the Azure stack, Power BI is the preferred service for data visualization for Azure data platforms. However, Microsoft allows other visualization services to connect conveniently as well.
Another way to consume data is by making the data available to other applications or platforms using APIs.
Chapter 8, Visualizing Data Using Power BI, will teach you how to extract data from the data warehouse in various ways and visualize it using interactive dashboarding. In this chapter, you will also discover methods to perform self-service BI, allowing end users to create their own ad hoc dashboards and reports to quickly perform data analysis.
Data orchestration and processing
Data orchestration refers to moving data from one place to another, often using data pipelines. This process is often done by data engineers. When data is moved from one stage to the next, data undergoes transformations in the form of joining data, deriving new columns, computing aggregations, and so on. For example, when data is moved from a data lake to a data warehouse, it must be transformed to match the data model, which is enforced by the data warehouse. Another example is when moving data between tiers (raw, curated, and enriched tiers) in the data lake, where the data becomes more and more ready for business use whenever it moves up a tier.
Data pipelines allow data engineers to automate and scale the orchestration and processing of data. These components are critical to the performance and health of the data platform and must be monitored accordingly.
- Extract-Transform-Load (ETL)
- Extract-Load-Transform (ELT)
In both cases, data is extracted from a source and loaded to a destination. The main difference between both methods is the location where the transformations take place. These will be further discussed in Chapter 4, Transforming Data on Azure. This chapter will also teach you how to create and monitor data pipelines according to best practices.
For analyses that may be too complex to perform in the serving layer, an analytics suite or data science environment can be added to the architecture to perform advanced analytics and unlock ML capabilities. This component can often be added in a later stage of platform development, as it will mostly not influence the core working of the other layers. A data platform in an early phase of development can perfectly exist without this component.
One option for the advanced analytics suite is an ML workspace where data scientists can preprocess data, perform feature engineering, and train and deploy ML models. The latter may require additional components such as a container registry for storing and managing model deployments. The Azure Machine Learning workspace allows users to create and run ML pipelines to scale their data science processes. It also enables citizen data scientists to train models using no-code and low-code features.
Apart from an environment for data scientists and ML engineers to build and deploy custom models, the Azure cloud also provides users with a wide array of pre-trained ML models. Azure Cognitive Services encompass many models for computer vision (CV), speech recognition, text analytics, search capabilities, and so on. These models are available through ready-to-use API endpoints. They often involve niche cases but, when used correctly, bring a lot of value to the solution and are exceptionally fast to implement.
Chapter 9, Advanced Analytics Using AI, will go deeper into end-to-end ML workflows, such as the connection to data storages, performing preprocessing, model training, and model deployments. This chapter will also introduce the concepts of ML operations, often referred to as MLOps. This encompasses continuous integration and continuous development (CI/CD) for ML workflows.
Data governance and compliance
The more a data platform scales, the harder it becomes to maintain a clear overview of existing data sources, data assets, transformations, data access control, and compliance. To avoid a build-up of technical backlog, it is strongly recommended to start the setup of governance and compliance processes from an early stage of development and have it scale with the platform.
To govern Azure data platforms, Microsoft developed Microsoft Purview, formerly known as Azure Purview. This tool, which is covered in Chapter 10, Enterprise-Level Data Governance and Compliance, allows users to gain clear insights into the governance and compliance of the platform. Therefore, it is essential to the skill set of any aspiring Azure data architect. In this chapter, you will learn how to do the following:
- Create a data map by performing scans on data assets
- Construct a data catalog to provide an overview of the metadata of data assets
- Build a business glossary to establish clear definitions of possibly ambiguous business terms
- Gain executive insights on the entire data estate
With the growing rise of harmful cyber-attacks, security is another indispensable component of a data platform. Improper security or configurations may lead to tremendous costs for the business. Investing in robust security to prevent attacks from happening will typically be vastly cheaper than dealing with the damage afterward.
Cybersecurity can be very complex and therefore should be configured and managed using the help of a cybersecurity architect. However, certain aspects of security should fall into the responsibilities of the data architect as well. The data architect should have the appropriate skill set to establish data security. Examples are working with row- or column-level security, data encryption at rest and in transit, masking sensitive data, and so on.
Chapter 11, Introduction to Data Security, will teach you all that is necessary to ensure data is always well protected and access is always limited to a minimum.
Disruptions such as failing data pipelines, breaking transformations, and unhealthy deployments can shut down the workings of an entire data platform. To limit the downtime to an absolute minimum, these processes and deployments should be monitored continuously.
Azure provides monitoring and health reports on pipeline runs, Spark and SQL jobs, ML model deployments, data asset scans, and more. The monitoring of these resources will be further discussed in their own respective chapters.
Challenges of on-premises architectures
Cloud computing has seen a steep rise in adoption during the last decade. Nevertheless, a significant chunk of businesses hold on to keeping their servers and data on-premises. There are certain reasons why a business may prefer on-premises over the cloud. Some businesses have the perception of increased security when keeping data on their own servers. Others, generally smaller businesses, may not feel the need to optimize their IT landscape or simply are not keen on change. Organizations in strictly regulated industries can be bound to on-premises for compliance. Whichever the reason, on-premises architectures nowadays come with certain challenges.
These challenges include, among other things, the following:
- Cost optimization
Let’s go through these challenges in detail.
Organizations with a rapidly enlarging technological landscape will struggle the most to overcome the challenge of scalability. As the total business data volume keeps growing continually, an organization faces the constant need of having to find new ways to expand the on-premises server farm. It is not always as simple as just adding extra servers. After a while, extra building infrastructure is needed, new personnel must be hired, energy consumption soars, and so on.
Here, the benefit of cloud computing is the enormous pool of available servers and computing resources. For the business, this means it can provision any additional capacity without having to worry about the intricate organization and planning of its own servers.
Let’s take a look at two scenarios:
- When usage increases: When the usage increases, the need for extra capacity arises. A business is not going to wait until its servers are used to their limits, risking heavy throttling and bottleneck issues, before starting to expand its capacity. Although the risk of full saturation of its servers is hereby avoided, the computing and storage capacity is never fully made use of. While usage can grow linearly or exponentially, costs will rise in discrete increments, referring to distinct expansions of server capacity.
- When usage decreases: When the usage decreases, the additional capacity is simply standing there, unused. Even if the decrease in usage lasts for longer periods of time, it is not that simple to just sell the hardware, free up the physical space, and get rid of the extra maintenance personnel. In most situations, this results in costs remaining unchanged despite the usage.
Cloud computing usually follows a pay-as-you-go (PAYG) business model. This solves the two challenges of cost optimization during variable usage. PAYG allows businesses to match their costs to their usage, avoiding disparities, as can be seen in the following diagram:
Figure 1.4 – Cost patterns depending on usage for on-premises and cloud infrastructure
Let’s cover the next challenge now.
In contrast to whether it is possible to make a certain change, agility refers to the speed at which businesses can implement these new changes. Expanding or reducing capacity, changing the types of processing power, and so on takes time in an on-premises environment. In most cases, this involves the acquisition of new hardware, installing the new compute, and configuring security, all of which can be extremely time-consuming in a business context.
Here, cloud architectures benefit from far superior agility over on-premises architectures. Scaling capacity up or down, changing memory-optimized processors for compute-optimized processors: all of this is performed in a matter of seconds or minutes.
The challenge of flexibility can be interpreted very broadly and has some intersections with the other challenges. Difficulties with scalability and agility can be defined as types of flexibility issues.
Apart from difficulties regarding scalability and agility, on-premises servers face the issue of constant hardware modernization. In this case, we could compare on-premises and cloud infrastructure to a purchased car or a rental car respectively. There is not always the need to make use of cutting-edge technology, but if the need is present, think about which option will result in having a more modern car in most situations.
In other cases, specialized hardware such as field-programmable gate arrays (FPGAs) might be required for a short period of time—for example, during the training of an extraordinarily complex ML model. To revisit the car example, would you rather purchase a van when you occasionally have to move furniture or rent a van for a day while moving?
Let’s summarize the chapter next.
In this chapter, we first discussed how to extract value from your data by asking the right analytical questions. Questions may increase in complexity from descriptive, diagnostic, and predictive to prescriptive but may also hold more value. A complexity-value matrix is necessary to prioritize data projects and build a data roadmap. A crucial thing to remember is to capture data as soon as possible, even if you don’t have a data strategy or roadmap yet. All data that you do not capture now cannot be used in the future to extract value from. Next, we introduced a reference architecture diagram. Over time, you will get familiar with every component of the diagram and how they interact with each other.
Four layers of cloud architectures were explained. The ingestion layer is used to pull data into the central cloud data platform. The storage layer is capable of holding massive amounts of data, often in a tiered system, where data gets more business-ready as it moves through the tiers. In the serving layer, the data warehouse is located, which holds data with a strictly enforced schema and is optimized for analytical workloads. Lastly, the consumption layer allows end users and external systems to consume the data in reports and dashboards or to be used in other applications.
Some components of the data platform span across multiple layers. Data orchestration and processing refers to data pipelines that ingest data into the cloud, move data from one place to another, and orchestrate data transformations. Advanced analytics leverages Azure’s many pre-trained ML models and a data science environment to perform complex calculations and provide meaningful predictions. Data governance tools bring data asset compliance, flexible access control, data lineage, and overall insights into the entire data estate. Impeccable security of individual components as well as the integrations between them takes away many of the worries regarding harmful actions being made by third parties. Finally, the extensive monitoring capabilities in Azure allow us to get insights into the health and performance of the processes and data storage in the platform.
Finally, we discussed the drawbacks that on-premises architectures face, such as scalability, cost optimization, agility, and flexibility. These challenges are often conveniently dealt with by leveraging the benefits of cloud-based approaches.
In the next chapter, we will look at two Microsoft frameworks that ease the move to the cloud.