Is the Data Warehouse dead with the advent of Data Lakes? There is disagreement everywhere about the need for Data Warehousing in a modern data estate. With the rise of Data Lakes and Big Data technology, many people use other, newer technologies compared to databases for their analytical efforts. Establishing a data-driven company seems to be possible without all those narrow definitions and planned structures, the ETL/ELT, and all the indexing for performance. But when we examine the technology carefully, when we compare the requirements that are formulated in analytical projects, free of prejudice to the functionality that the chosen services or software packages can deliver, we often find gaps on both ends. This chapter discusses the capabilities of Data Warehousing and Data Lakes and introduces the concept of the Modern Data Warehouse.
With all the innovations that have been brought to us in the last few years, such as faster hardware, new technologies, and new dogmas such as the Data Lake, older concepts and methods are being questioned and challenged. In this chapter, I would like to explore the evolution of the analytical world and try to answer the question, is the Data Warehouse really obsolete?
We'll find out by covering the following topics:
- Distinguishing between Data Warehouses and Data Lakes
- Understanding the opportunities of modern cloud computing
- Exploring the benefits of AI and ML
- Answering the question
Distinguishing between Data Warehouses and Data Lakes
There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.
Understanding Data Warehouse patterns
When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.
Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.
In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.
Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:
- Star-Join/Snowflake: This is probably the most famous method for Data Warehouse modeling. Fact tables are put in the center of the model, while Dimension tables are arranged around them, inheriting their Primary Key into the Fact table. In the Star-Join method, we find a lot of redundancy in the tables since as all the Dimension data, including all hierarchical information (such as Product Group -> Product SubCategory -> Product Category) regarding a certain artifact (Product, Customer, and so on), is stored in one table. In a Snowflake schema, hierarchies are spread in additional tables per hierarchy level and are linked over relationships with each other. This, when expressed in a graph, turns out to show a kind of snowflake pattern.
- Data Vault: This is a newer method that reflects the rising structural volatility of data sources, which offers higher flexibility and speed for developing. Entities that need to be analyzed are stored over Hubs, Satellites, and Links. Hubs simply reflect the presence of an entity by storing its ID and some audit information such as its data source, create times, and so on. Each hub can have one or more Satellite(s). These Satellites store all the descriptive information about the entity. If we need to change the system and add new information about an entity, we can add another Satellite to the model, reflecting just the new data. This will bring the benefit of non-destructive deployments to the productive system in the rollout. In the Data Vault, the Customer data will be stored in one Hub (the CustomerID and audit columns) and one or more Satellites (the rest of the customer information). The structure in the model is finally brought by Links. They are provided with all the Primary Keys of the Hubs and, again, some metadata. Additionally, Links can have Satellites of their own that describe the relationships of the Link content. Therefore, the connection between a Customer and the Products they bought will be reflected in a Link, where the Customer-Hub-Key and the Product-Hub-Key are stored together with audit columns. A Link Satellite can, for example, reflect some characteristics of the relationship, such as the amount bought, the date, or a discount. Finally, we can even add a Star-Join-View schema to abstract all the tables of the Data Vault and make it easier to understand for the users.
- 3rd Normal form: This is the typical database modeling technique that is also used in (and first created for) so-called Online Transactional Processing (OLTP) databases. Artifacts are broken up into their atomic information and are spread over several tables so that no redundancy is stored in either table. The Product information of a system might be split in separate tables for the product's name, color, size, price information, and many more. To derive all the information of a 3rd Normal Form model, a lot of joining is necessary.
But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.
The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.
There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.
Understanding Data Warehouse layers
Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:
- Landing or staging area: Data is imported into this layer in its rawest format. Nothing is changed on its way to this area and only audit information is added to track down all the loads.
- QS, transient, or cleansing area: This is where the work is done. You will only find a few projects where data is consistent. Values, sometimes even mandatory ones from the source, may be missing, content might be formatted incorrectly or even corrupted, and so on. In this zone, all the issues with the source data are taken care of, and the data is curated and cleansed.
- Core Data Warehouse: In this part of the database, the clean data is brought into the data model of choice. The model takes care of data historization, if required. This is also one possible point of security – or, even better, the central point of security. Data is secured so that all participating users can only see what they can see. The Core Data Warehouse is also the place where performance is boosted. By using all the available database techniques, such as indexes, partitioning, compression, and so on, the Data Warehouse will be tuned to suit the performance needs of the reporting users. The Core Data Warehouse is often also seen as the presentation layer of the system, where all kinds of reporting and dashboarding and self-service BI tools can connect and create their analysis.
Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.
Implementing reporting and dashboarding
In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.
The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.
Loading bigger amounts of data
You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.
Starting with Data Lakes
It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a Hadoop Big Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.
With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.
With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.
MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.
In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.
Understanding the Data Lake ecosystem
HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.
Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.
These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.
Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.
Comparing Data Lake zones
Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.
So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.
If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.
This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.
Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?
The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.
Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).
The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.
Understanding the opportunities of modern cloud computing
Hyperscale cloud vendors such as Microsoft, AWS, and Google, where you can just swipe your credit card and get storage, a web server, or provision a VM of your choice, are a real game changer in the IT world and started an unstoppable revolution. The speed at which new applications are developed and rolled out gets faster and faster, and these cloud technologies reduce time-to-market dramatically. And the cloud vendors don't stop there. Let's examine the different types of offerings they provide you as their customer.
Infrastructure-as-a-Service (IaaS) was only the start. Foremost, we are talking about delivering VMs You can set up a system of VMs just as you would within your own data center, with domain controllers, database and application servers, and so on. Another example for IaaS would be storage that can be integrated just like a virtual network drive into the customers' workspaces.
Sure, a VM with a pre-installed OS and maybe some other software components is a nice thing to have. And you even can scale it if you need more power. But still, you would need to take care of patching the operating system and all the software components that have been installed, the backup and disaster recovery measurements, and so on.
Nowadays, cloud vendors are making these activities easier and help automate this process. But still, this is not the end of the speed, automation, and convenience that a cloud offering could bring:
We'll now check out PaaS.
Platform-as-a-Service (PaaS) soon gained the attention of cloud users. Need a database to store data from your web app? You go to your cloud portal, click the necessary link, fill in some configuration information and click OK, and there, you are done... your web app now runs on a service that can offer far better Service-Level Agreements (SLAs) than the tower PC under the desk at your home.
For example, databases, as one interesting representative of a PaaS, offer automated patching with new versions, backup, and disaster recovery deeply integrated into the service. This is also available for a queue service or the streaming component offered on the cloud.
For databases, there are even point-in-time restores from backups that go quite far back in time, and automated geo redundancy and replication are possible, among many more functions. Without PaaS support, an administrator would need to invest a lot of time into providing a similar level of maintenance and service.
No wonder that other use cases made it to the Hyperscalers' (the big cloud vendors, which can offer nearly unlimited resources) offerings. This is also why they all (with differing levels and completeness, of course) offer PaaS capabilities to implement a Modern Data Warehouse/Data Lakehouse in their data centers nowadays.
Some of these offerings are as follows:
- Storage components that can store nearly unlimited volumes of data.
- Data integration tools that can access on-premises, cloud, and third-party software that reaches compute components.
- The ability to scale and process data in a timely manner.
- Databases that can hold the amount of data the customer needs and answer the queries against that data with enough power to fulfill the time requirements set by users.
- Reporting/dashboarding tools, or the connectivity for tools of other vendors, to consume the data in an understandable way.
- DevOps life cycle support is available to help you develop, version, deploy, and run an application and gain even more speed with a higher quality.
The next evolutionary step was into Software-as-a-Service (SaaS), where you can get an "all-you-can-eat" software service to fulfill your requirements, without implementing the software at all. Only the configuration, security, and connection need to be managed with a SaaS solution. Microsoft Dynamics 365 is an example of such a system. You can use an Enterprise Resource Planning (ERP) suite from the cloud without the need to have it licensed and installed in your data center. Sure, the right configuration and the development of the ERP in the cloud still needs to be done, but the Hyperscaler cloud vendor and/or the SaaS provider relieves you of which server hardware, backup and restore, SLAs, and so on to use.
However, not all cloud users want to or can rely on predefined software to solve their problems. The need for individual software is still high and, as we discussed previously, the virtualization options that are available on the cloud platforms are very attractive in this case.
Examining the possibilities of virtual machines
Development can follow a few different options. Microsoft's Azure, for example, offers different capabilities to support application development on this platform. Starting with VMs, you can make use of any setup and configuration for this application that you can think of. Spin the VM, install and run the app, and off you go.
But don't forget – everything that you need to run the app needs to be installed and configured upfront. Maybe there are VM images available that come predefined in a certain configuration. Still, you as the developer are, after spinning it up, responsible for keeping the operating system and all the installed components up-to-date and secure.
The cloud vendor, however, supports you with automation features to help you keep up with changes – to back up the VM, for example – but like every individual server or PC, a VM is a complete computer and brings that level of complexity.
Let's assume you need to run a small piece of software to offer a function or one or more microservices; a cloud-based VM may be overkill. Even with all the cloud advantages of having a VM without the need of spinning up and maintaining your own hardware and data center, this can be quite costly and more complex than what is really needed.
Again, the Hyperscaler can help as it has different offerings to overcome this complexity. Looking at Microsoft's example, there are different approaches to solving this issue.
Understanding Serverless Functions
Serverless, in this case, means that the underlying infrastructure will scale with increasing and decreasing requests against the function, without needing the developer or the administrator to manually scale the backing components. This can be configured using different consumption plans, with different limits to keep control of the cost.
Cost, besides implementation speed and flexibility, is one of the most important factors when moving to cloud platforms. With the example of Serverless Functions, we find a great template for the elasticity of cloud services and the possibilities of saving money in comparison to on-premises projects.
Using a seamless scaling service for your functionality gives you the possibility to start small and start experimenting, without spending too much money. Should you find out that your development trajectory points to a dead end, you can directly stop all efforts. You can then archive or even delete the developed artifacts and restart your development with a new approach, without spending a fortune purchasing new hardware and software licenses.
Time is the investment here. Maybe these possibilities will lead to more and better judgement about the status and viability of a project that is underway. And maybe, in the future, less time will be wasted because of investments that need to be justified and followed. However, this may lead to quick-started, half-baked artifacts that are rolled out too early just because they can be started easily and cheaply. This is a situation that might lead to expensive replacement efforts and a "running in circles" situation. Therefore, a cost-effective runtime environment and a quick and easy development and deployment situation should still be approached carefully, and designed and planned to a certain degree.
Looking at the importance of containers
Container technologies are another chance to benefit from the cloud's elastic capabilities. They come with some more detailed possibilities for development and configuration in comparison to Serverless Functions. Every Hyperscaler nowadays offers one or more container technologies.
Containers enable you to deploy modularized functionalities into a runtime environment. They abstract from the OS, support all the necessary libraries and dependencies, and can hold any routine or executable that you might need. In comparison to VMs, containers can easily be ported and deployed from one OS to another. The containers that are running in a certain environment use the kernel of the OS and share it with each other.
VMs run their own OSes, apart from each other's, so they will also need far more maintenance and care. Within a container, you can concentrate on the logic and the code you want to implement, without the need to organize the "computer" around that. This also leads to a far smaller footprint for a container compared to VMs. Containers also boot far quicker than VMs and can be available quickly in an on-demand fashion. This leads to systems that can react nearly instantly to all kinds of events. A container failing, for example, can be mitigated by spinning up the image and redirecting the connection to it internally.
Therefore, in terms of their basic usage, containers are stateless to ensure quick. This adds up to the elasticity of cloud services. Developing and versioning of container modules eases their usage and increases the stability of applications based on this technology. You can roll back a buggy deployment easily, and deployments can be modularized down to small services (microservices approach).
Containers can be developed and implemented on your laptop and then deployed to a cloud repository, where they can be instantiated by the container runtime of choice.
In Microsoft Azure, the offering for containers comes from the container registry, where the typical container technologies can be placed and instantiated from. To run containers on Azure, you can use different offerings, such as Azure Container Instances or Azure Kubernetes Service (AKS). Azure Red Hat OpenShift and Azure Batch can also be used to run containers in some situations. It is also worth mentioning Azure Service Fabric and Azure App Service, which can host, use, or orchestrate containers on Azure:
Exploring the advantages of scalable environments
Modern Data Warehouse requirements may reach beyond the functionalities of the typical out-of-the-box components of the ETL/ELT tools. As serverless functions and containers can run a wide variety of programming languages, they can be used to add nearly any function that is not available by default. A typical example is consuming streaming data as it pours into the platform to the queuing endpoints. As some of the streaming components might not offer extended programming functionality, serverless functions and/or containers can add the needed features, and can also add improved performance while keeping the overall solution simple and easy to implement.
Implementing elastic storage and compute
Talking about the possibilities created by cloud computing in the field of data, analytics, and AI, I need to mention storage and compute and the distinction between the two. To scale cloud components to fulfill volume or performance requirements, there are often services where the two are closely coupled. Looking at VMs, for example, higher performance measures such as virtual CPUs automatically require more local storage, which will always cause higher costs in both dimensions. This is also true for scaling many databases. Adding vCores to a database will also add memory and will scale diskspace ranges. Looking at the databases on Azure, for example, the diskspace can be influenced in certain ranges, but these ranges are still coupled to the amount of compute that the database consumes.
The development of serverless services in the data and AI sector is leading to new patterns in storage and compute. Many use cases benefit from the separate scaling possibilities that this distinction offers. There are also complex computations on smaller datasets whose internal complexity needs some significant computational power, be it for a high amount of iterations or wider join requirements. Elastically increasing compute for a particular calculation is an opportunity to save money. Why?
As you spin up or scale a computational cluster and take it back down once the computation is done, you only pay for what you consume. The rest of the time, you run the clusters in "keep-the-lights-on" mode at a lower level to fulfill routine requests. If you don't need a compute component, you can take it down and switch it off:
More and more, we even see automatic scaling for compute components, such as in Spark clusters (as shown in the preceding screenshot). You can decide on lower and upper limits for the cluster to instantly react on computational peaks when they occur during a computation. The clusters will even go into hibernation mode if there is no computation going on. This can also be accomplished with the database technology that makes up the Modern Data Warehouse, for example, on Microsoft Azure.
As soon as a database is not experiencing high demand, you can scale it back down to a minimum to keep it working and answer random requests. Or, if it is not needed at all, you just switch it off and avoid any compute costs during that time. This comes in handy when you're working in development or test systems, or databases that are created for certain purposes and aren't needed 24/7. You can even do this with your production systems, which are only used during office times for an ELT/ETL load.
Hyperscaler cloud vendors don't limit you in terms of the amount and formats of data you can put into your storage service. They also often offer different storage tiers, where you can save money by deciding how much data needs to be in hot mode, a slightly more expensive mode for recurring read/write access, or "cool" mode, which is cheaper for longer periods without access. Even archive modes are available. They are the cheapest modes and intended for data that must be kept but will be accessed only in rare cases.
Cloud storage systems act just like a hard drive would in your laptop, but with far greater capacity. Looking at a Modern Data Warehouse that wants to access the storage, you would need sufficient performance when writing to or reading from such a storage. And performance is always worked on at the big cloud vendors.
Talking about Azure specifically, many measures are taken to improve performance, but also stability and reliance. One example is the fact that files are tripled and spread over different discs in the data center; they are also split into blocks, such as on HDFS, for parallel access and performance gains. This, with other, far more sophisticated techniques, increases the reading speed for analysis significantly and adds to the stability, availability, and reliability of the system.
The cloud vendors, with their storage, computational capabilities, and the elasticity of their service offerings, will lay the foundation for you to build successful and financial competitive systems. The nature of the pay-as-you-go models will make it easy for you to get started with a cloud project, pursue a challenge, and succeed with a good price/performance ratio. And, looking at the PaaS offerings, a project can be equipped with the required components in hours instead of weeks or even months. You can react, when the need arises, instead of purchasing new hardware and software over lengthy processes. If a project becomes obsolete or goes down the wrong path for any reason, deleting the related artifacts and eliminating the related cost can be done very easily.
Cloud technology can help you get things done more quickly, efficiently, and at a far lower cost. We will be talking about the advantages of cloud security later in this book.
Exploring the benefits of AI and ML
Companies start building AI projects and massively rely on those statistical functions and the math behind them to predict customer churn, recognize images, detect fraud, mine knowledge, and so much more. ML projects often begin as part of a big data or Data Warehouse project, but this can also be the other way around; that is, the start of an AI or machine learning project often leads to the development of an analytical system.
As my data scientist colleagues tell me, if you want to be able to really predict an event based on incoming data, a machine learning model needs to be trained on quite a large amount of data. The more data you can bring to train a model, the better and more accurate the model will be in the end.
A wonderful experiment to do with image recognition, for example, can be done at www.customvision.ai. You can start by examining one of the example projects there. I like the "Chocolate or Dalmatian" example.
This is a nice experiment that did not need too much of input to enable the image recognizer to distinguish between Stracciatella Chocolate ice cream and Dalmatian dogs. When you try to teach the system on different images and circumstances, you might find out that you need far more training images than six per group.
Understanding ML challenges
I have experimented with the service and uploaded images of people in an emergency versus images of people relaxing or doing Yoga or similar. I used around 50 – 60 images for each group and still didn't reach a really satisfying accuracy (74%).
With this experiment, I even created a model with a bias that I first didn't understand myself. There were too many "emergency" cases being interpreted incorrectly as "All good" cases. By discussing this with my data scientist colleagues and examining the training set, we found out why.
There were too many pictures in the "All good" training set that showed people on grass or in nature, with lots of green around them. This, in turn, led the system to interpret green as a signifier of "All good," no matter how big the emergency obviously was. Someone with their leg at a strange angle and a broken arm, in a meadow? The model would interpret it as "All good."
In this sandbox environment, I did no harm at all. But imagine a case where a system is used to help detect emergency situations, and hopefully kickstart an alarm those vital seconds earlier. This is only a very basic example of how the right tool in the wrong hands might cause a lot of damage.
There are so many different cases where machine learning can help increase the accuracy of processes, increase their speed, or save them money because of the right predictions – such as predicting when and why a machine will fail before it actually fails, helping to mine information from massive data, and more.
Sorting ML into the Modern Data Warehouse
How does this relate to the Modern Data Warehouse? As we stated previously, the Modern Data Warehouse does not only offer scalable, fast, and secure storage components. It also offers at least one (and, in the context of this book, six) compute component(s) that can interact with the storage services and can be used to create and run machine learning models at scale. The "run" can be implemented in batch mode, near-real time or even in real time, depending on the streaming engine used. That Modern Data Warehouse can then store the results of ML calculations into a suitable presentation layer to provide this data to the downstream consumers, who will process the data further, visualize it, draw their insights from it, and take action. The system can close the loop using the enterprise service bus and the integration services available to feed back insights as parameters for the surrounding systems.
Understanding responsible ML/AI
A responsible data scientist will need tools that support them conducting this work properly. The buzzword of the moment in that area is machine learning operations (MLOps). One of the most important steps of creating a responsible AI is having complete traceability/auditability of the source data, and the versions of the datasets used to train and retrain a certain model at a certain timestamp. With this information, the results of the model can also be audited and interpreted. This information is vital when it comes to audits and traceability regarding legal questions, for instance. The collaborative aspects of an MLOps-driven environment are another important factor.
We can find a definition of Responsible AI, following the principles of Fairness, Reliability and Safety, Privacy and Security, Inclusiveness, and Transparency and Accountability, at https://www.microsoft.com/en-us/ai/responsible-ai.
An MLOps environment embedded in the Modern Data Warehouse will be another puzzle part of the bigger picture and helps integrate those principles into the analytical estate of any company. With the tight interconnectivity between data services, storage, compute components, streaming services, IoT technology, ML and AI, and visualization tools, the world of analytics today offers a wide range of possibilities at a far lower cost than ever before. The ease of use of these services and their corresponding productivity is constantly growing.
Answering the question
Given everything we have learned, is the Data Warehouse dead because of the availability of Data Lakes?
Not at all! What if you could combine the two philosophies and benefit from the advantages of both? The terms that are used for this new approach, which is powered by the availability, scalability, and elasticity offered by the cloud vendors, are "the Data Lakehouse" and, the term that I like more, the Modern Data Warehouse.
We will explore this concept in more detail in Chapter 2, Connecting Requirements and Technology.
In this chapter, we examined the differences between Data Warehouses and Data Lakes and the advantages of both approaches. We have the structured model of the Data Warehouse with its security, tuning possibilities, and accessibility for Self-Service BI on the one hand, and the capabilities of Data Lake systems to process vast amounts of data in high performance to support machine learning on the other.
Both concepts, when implemented in isolation, can help solve certain problems. However, even with the growing data, the disparate source systems, their various formats, and the required speed of delivery, as well as the requirements for security and usability, neither can succeed on their own: there is life in the old Data Warehouse yet!
The combination of the two concepts, together with the extended offerings of the Hyperscaler cloud vendors such as virtualization, container offerings, and serverless functions can open new opportunities in terms of the flexibility, agility, and speed of implementation. We are getting the best of both worlds.
In the next chapter, we will discuss a generic architecture sketch. You will learn about the different building blocks of a Modern Data Warehouse approach and how to ask the right questions during your requirements engineering process. In the second part of the next chapter, we will examine Azure Data Services and PaaS components. We'll explore alternative components for different sizes and map the Azure Services to the Modern Data Warehouse architecture.