Is the Data Warehouse dead with the advent of Data Lakes? There is disagreement everywhere about the need for Data Warehousing in a modern data estate. With the rise of Data Lakes and Big Data technology, many people use other, newer technologies compared to databases for their analytical efforts. Establishing a data-driven company seems to be possible without all those narrow definitions and planned structures, the ETL/ELT, and all the indexing for performance. But when we examine the technology carefully, when we compare the requirements that are formulated in analytical projects, free of prejudice to the functionality that the chosen services or software packages can deliver, we often find gaps on both ends. This chapter discusses the capabilities of Data Warehousing and Data Lakes and introduces the concept of the Modern Data Warehouse.
With all the innovations that have been brought to us in the last few years, such as faster hardware, new technologies, and new dogmas such as the Data Lake, older concepts and methods are being questioned and challenged. In this chapter, I would like to explore the evolution of the analytical world and try to answer the question, is the Data Warehouse really obsolete?
We'll find out by covering the following topics:
There are several definitions of Data Warehousing on the internet. The narrower ones characterize a warehouse as the database and the model used in the database; the wider descriptions look at the term as a method, and a suitable collection of all organizational and technological components that make up a BI solution. They talk about everything from the Extract Transform Load tool (ETL tool) to the database, the model, and, of course, the reporting and dashboarding solution.
When we look at the Data Warehousing method in general, at its heart, we find a database that offers a certain table structure. We almost always find two main types of artifacts in the database: Facts and Dimensions.
Facts provide all the measurable information that we want to analyze; for example, the quantities of products sold per customer, per region, per sales representative, and per time. Facts are normally quite narrow objects, but with a lot of rows stored.
In the Dimensions, we will find all the descriptive information that can be linked to the Facts for analysis. Every piece of information that a user puts on their report or dashboard to aggregate and group the fact data, filter it, and view it is collected in the Dimensions. All the data related to customer information, such as Product, Contract, Address, and so on, that might need to be analyzed and correlated is stored here. Typically, these objects are stored as tables in the database and are joined using their key columns. Dimensions are normally wide objects, sometimes with controlled redundancy, that look at the given modeling method.
Three main methods for modeling the Facts and Dimensions within a Data Warehouse database have crystalized over the years of its evolution:
But how does data finally land in the Data Warehouse database? The process and the related tools are named Extract, Transform, Load (ETL), but depending on the sequence we implement it in, it may be referred to as ELT. You'll find several possible ways to implement data loading into a Data Warehouse. This can be done by implementing specialized ETL tools such as Azure Data Factory in the cloud, SQL Server Integration services (SSIS), Informatica, Talend, or IBM Data Stage, for example.
The biggest advantage of these tools is the availability of wide catalogues of ready-to-use source and target connectors. They can connect directly to a source, query the needed data, and even transform it while being transported to the target. In the end, data is loaded into the Data Warehouse database. Other advantages include its graphical interfaces, where complex logic can be implemented on a "point-and-click" basis, which is very easy to understand and maintain.
There are other options as well. Data is often pushed by source applications and a direct connection for the data extraction process is not wanted at all. Many times, files are provided that are stored somewhere near the Data Warehouse database and then need to be imported. Maybe there is no ETL tool available. Since nearly every database nowadays provides loader tools, the import can be accomplished using those tools in a scripted environment. Once the data has made its way to the database tables, the transformational steps are done using Stored Procedures that then move the data through different DWH stages or layers to the final Core Data Warehouse.
Talking about the Data Warehouse layers, we nearly always find several steps that are processed before the data is provided for reporting or dashboarding. Typically, there are at least the following stages or layers:
Additionally, Data Warehouses have also brought in other types of layers. There might be sublayers for the cleansing area, for example, or the Data Marts, which are used to slice data semantically for the needs of certain user groups, or the Operational Data Store (ODS), which has different definitions, depending on who you ask. Sometimes, it is used as an intermediary data store that can be used for reporting. Other definitions speak of a transient zone that stores data for integration and further transport to the DWH. Sometimes, it is also used as a kind of archived data pool that the Data Warehouse can always be reproduced from. The definitions vary.
In terms of the reporting/dashboarding solution, we will also find several approaches to visualize the information that's provided by the Data Warehouse. There are the typical reporting tools, such as SQL Server Reporting Service or Crystal Reports, for example. These are page-oriented report generators that access the underlying database, get the data, and render it according to the template being used.
The more modern approach, however, has resulted in tools that can access a certain dataset, store it in an internal caching database, and allow interactive reporting and dashboarding based on that data. Self-Service BI tools such as Power BI, QLIK, and Tableau allow you to access the data, create your own visuals, and put them together in dashboards. Nowadays, these tools can even correlate data from different data sources and allow you to analyze data that might not have made it to the Data Warehouse database yet.
You can scale databases with newer, faster hardware, more memory, and faster disks. You can also go from Symmetric Multi-Processing (SMP) to Massively Parallel Processing (MPP) databases. However, the usual restrictions still apply: the more data we need to process, the longer it will take to do so. And there are also workloads that databases will not support, such as image processing.
It's funny that we find a similar mix or maybe even call it confusion when we examine the term Data Lake. Many people refer to a Data Lake as a Hadoop Big Data implementation that delivers one or more clusters of computers, where a distributed filesystem and computational software is installed. It can deal with a distributed Input and Output (I/O) on the one hand but can also do distributed and parallel computation. Such a system adds specialized services for all kinds of workloads, be it just SQL queries against the stored data, in-memory computation, streaming analytics – you name it. Interestingly, as we mentioned while discussing Data Warehouses discussion, the narrower definition of the Data Lake only refers to the storage solution, and less to the method and the collection of services. To be honest, I like the wider definition far better, as it describes the method more holistically.
With the Hadoop Distributed File System (HDFS) from the Apache Software Foundation, a system is delivered that is even capable of storing data distributed over cheap legacy hardware clusters. HDFS splits the files into blocks and will also replicate those blocks within the system. This not only delivers a failsafe environment, but it also generates the biggest advantage of HDFS: parallel access to the data blocks for the consuming compute components. This means that a Spark cluster, for example, can read several data blocks in parallel, and with increased speed, because it can access several blocks of a file with several parallel threads.
With the possibility to hand files to a filesystem and start analyzing them just where they are, we enter another dogma.
MapReduce, the programming paradigm, supports parallel processing many files in this distributed environment. Every participating node runs calculations over a certain subset of all the files that must be analyzed. In the end, the results are aggregated by a driver node and returned to the querying instance. Different to the Data Warehouse, we can start analyzing data on a Schema-On-Read basis. This means we decide on the structure of the files that are processed, right when they are being processed. The Data Warehouse, in comparison, is based on a Schema-On-Write strategy, where the tables must be defined before the data is loaded into the database.
In the Data Lake world, you, as the developer, do not need to plan structures weeks ahead, as you would in the Data Warehouse. Imagine you have stored a vast quantity of files, where each row in a file will reflect an event, and each row also consists of, let's say, 100 columns. Using Schema-On-Read, you can just decide to only count the rows of all the files. You won't need to cut the rows into the columns for this process. For other purposes, you might need to split the rows during the reading process into their own columns to access the detailed information. You might need to predict a machine failure based on the content of the columns. Using a database, you can also mimic this behavior, but you would need to create different objects and selectively store data for each intention.
HDFS is the center of a Data Lake system. But we can't get further with just a filesystem, even one as sophisticated as this one. The Hadoop open source world around HDFS has therefore developed many services to interact and process the content that's kept in distributed storage. There are services such as Hive, a SQL-like Data Warehousing service; Pig, for highly parallel analysis jobs; Spark as a large-scale, in-memory analytical engine; Storm as a distributed real-time streaming analysis engine; Mahout for machine learning; Hbase as a database; Oozie, for workflows; and many more.
Spark clusters, with their ability to run distributed, in-memory analytical processes on distributed datasets, have influenced the market heavily in the recent years. Scala, one of the main programming languages used with Spark, is a Java-based programming language. It can be used to write high-performance routines for ETL/ELT, data cleansing, and transformation, but also for machine learning and artificial intelligence on massive datasets. Python also made it into this technology and is one of the go-to languages here. R, as the statistical programming language of choice for many data scientists, was also a must for the Spark offering. Dealing with data has been important for so many years and guess what – SQL is still a language that was not possible to skip in such an environment.
These services and languages are making the open source Hadoop ecosystem a rich, complex engine for processing and analyzing excessive amounts of data. They are available all over the clusters and can interact with the HDFS to make use of the distributed files and compute.
Tip
Jumping into a big data challenge should, just like a Data Warehouse project, always be a well-prepared and intensively examined project. Just starting it from an "I want that too!" position would be the worst driver you can have, and unfortunately happens far too often. When you start with purpose and find the right tool for that purpose, at least that selection might create the right environment for a successful project. Maybe we will find some alternatives for you throughout this book.
Very similar to the Data Warehouse layers, as we discussed previously, a Data Lake is also structured in different layers. These layers form the stages that the data must go through on its way to be formed into information. The three major zones are pretty much comparable. We can form a landing zone in the Data Lake, where data is written to "as-is" in the exact format as it comes from the source. The transient zone then compares to the cleansing area of the Data Warehouse. Data is transformed, wrangled, and massaged to suit the requirements of the analyzing users. As no one should access either the landing zone or the transient zone, the curated zone is where you're heading, and what you, as the developer, will allow your users to access. Other (similar) concepts talk about Bronze, Silver, and Gold areas, or use other terms. Just like in the Data Warehouse concept, the Data Lakes can include other additional zones. For example, zones for master data, user sandboxes, or logging areas for your routines. We can have several user- or group-related folder structures in the curated zone where data is aggregated, just like the Data Marts of the DWH.
So, yes, you can be flexible and agile. But in a Data Lake with Schema-On-Read, you'll need to decide which attributes you want to analyze or are needed by your machine learning models for training. This will, after all, force you to structure your sources and will therefore force you into a certain project life cycle. You will go through user stories, requirements analysis, structuring and development, versioning, and delivering artifacts.
If you're only analyzing tabular-oriented data, maybe it's worth checking if your hardware, your ETL-tool, and your company databases can scale to the needed volume.
This question, along with the nature of the source's data, complexity, and format, should be taken into account when you're deciding on the technology to use. If you need to process sound files, images, PDFs, or similar, then this is no job for a database. SQL does not offer language elements for this (you can add programming extensions to your database). But here, we have a definitive marker for the possible usage of technologies other than databases.
Once you have analyzed the so-called unstructured data, you will structure the results back into a tabular-oriented result set as an array (a tabular representation) of data. This needs to be delivered somehow to the recipients of your analysis. But how is that done in a Data Lake?
The typical reporting tools still require tables or table-like source data to work with. Often, they will import the source data into their internal database to be able to quickly answer reporting and dashboarding operations by their users. Experiences with these visualization tools have shown that it is not very performant to directly report from a vast number of files from a Data Lake. Data always needs to be condensed into digestible chunks for these purposes and should then also be stored in a suitable and accessible way.
Funnily enough, many of the services in the Hadoop ecosystem are equipped with similar functionality to the ones that databases have offered for ages now and are optimized for. Data Warehouse databases are more mature in many respects: when we look at Hive as the Data Warehouse service in Hadoop, for example, it still can't update the data and can't run queries with nested subqueries in Hive-QL. But over the years, it has been extended with all kinds of database-like functionality (views, indexes, and many more).
Important Note
The Data Lake approaches are still missing a fine-grained security model that can deliver centralized security mechanisms to control Row-Level-Security and Column-Level-Security over data while also allowing an easy-to-implement mechanism for Data Masking.
Hyperscale cloud vendors such as Microsoft, AWS, and Google, where you can just swipe your credit card and get storage, a web server, or provision a VM of your choice, are a real game changer in the IT world and started an unstoppable revolution. The speed at which new applications are developed and rolled out gets faster and faster, and these cloud technologies reduce time-to-market dramatically. And the cloud vendors don't stop there. Let's examine the different types of offerings they provide you as their customer.
Infrastructure-as-a-Service (IaaS) was only the start. Foremost, we are talking about delivering VMs You can set up a system of VMs just as you would within your own data center, with domain controllers, database and application servers, and so on. Another example for IaaS would be storage that can be integrated just like a virtual network drive into the customers' workspaces.
Sure, a VM with a pre-installed OS and maybe some other software components is a nice thing to have. And you even can scale it if you need more power. But still, you would need to take care of patching the operating system and all the software components that have been installed, the backup and disaster recovery measurements, and so on.
Nowadays, cloud vendors are making these activities easier and help automate this process. But still, this is not the end of the speed, automation, and convenience that a cloud offering could bring:
We'll now check out PaaS.
Platform-as-a-Service (PaaS) soon gained the attention of cloud users. Need a database to store data from your web app? You go to your cloud portal, click the necessary link, fill in some configuration information and click OK, and there, you are done... your web app now runs on a service that can offer far better Service-Level Agreements (SLAs) than the tower PC under the desk at your home.
For example, databases, as one interesting representative of a PaaS, offer automated patching with new versions, backup, and disaster recovery deeply integrated into the service. This is also available for a queue service or the streaming component offered on the cloud.
For databases, there are even point-in-time restores from backups that go quite far back in time, and automated geo redundancy and replication are possible, among many more functions. Without PaaS support, an administrator would need to invest a lot of time into providing a similar level of maintenance and service.
No wonder that other use cases made it to the Hyperscalers' (the big cloud vendors, which can offer nearly unlimited resources) offerings. This is also why they all (with differing levels and completeness, of course) offer PaaS capabilities to implement a Modern Data Warehouse/Data Lakehouse in their data centers nowadays.
Some of these offerings are as follows:
The next evolutionary step was into Software-as-a-Service (SaaS), where you can get an "all-you-can-eat" software service to fulfill your requirements, without implementing the software at all. Only the configuration, security, and connection need to be managed with a SaaS solution. Microsoft Dynamics 365 is an example of such a system. You can use an Enterprise Resource Planning (ERP) suite from the cloud without the need to have it licensed and installed in your data center. Sure, the right configuration and the development of the ERP in the cloud still needs to be done, but the Hyperscaler cloud vendor and/or the SaaS provider relieves you of which server hardware, backup and restore, SLAs, and so on to use.
However, not all cloud users want to or can rely on predefined software to solve their problems. The need for individual software is still high and, as we discussed previously, the virtualization options that are available on the cloud platforms are very attractive in this case.
Development can follow a few different options. Microsoft's Azure, for example, offers different capabilities to support application development on this platform. Starting with VMs, you can make use of any setup and configuration for this application that you can think of. Spin the VM, install and run the app, and off you go.
But don't forget – everything that you need to run the app needs to be installed and configured upfront. Maybe there are VM images available that come predefined in a certain configuration. Still, you as the developer are, after spinning it up, responsible for keeping the operating system and all the installed components up-to-date and secure.
The cloud vendor, however, supports you with automation features to help you keep up with changes – to back up the VM, for example – but like every individual server or PC, a VM is a complete computer and brings that level of complexity.
Let's assume you need to run a small piece of software to offer a function or one or more microservices; a cloud-based VM may be overkill. Even with all the cloud advantages of having a VM without the need of spinning up and maintaining your own hardware and data center, this can be quite costly and more complex than what is really needed.
Again, the Hyperscaler can help as it has different offerings to overcome this complexity. Looking at Microsoft's example, there are different approaches to solving this issue.
One of these approaches is Azure Serverless Functions, which you can use to implement functionality in different languages. You can choose between C#, Java, JavaScript, PowerShell, or Python to create the function. Then, you can deploy it to the environment and just rely on the runtime environment. Seamless integration with the rest of the services, especially the data services on the platform, is available to you.
Serverless, in this case, means that the underlying infrastructure will scale with increasing and decreasing requests against the function, without needing the developer or the administrator to manually scale the backing components. This can be configured using different consumption plans, with different limits to keep control of the cost.
Cost, besides implementation speed and flexibility, is one of the most important factors when moving to cloud platforms. With the example of Serverless Functions, we find a great template for the elasticity of cloud services and the possibilities of saving money in comparison to on-premises projects.
Using a seamless scaling service for your functionality gives you the possibility to start small and start experimenting, without spending too much money. Should you find out that your development trajectory points to a dead end, you can directly stop all efforts. You can then archive or even delete the developed artifacts and restart your development with a new approach, without spending a fortune purchasing new hardware and software licenses.
Time is the investment here. Maybe these possibilities will lead to more and better judgement about the status and viability of a project that is underway. And maybe, in the future, less time will be wasted because of investments that need to be justified and followed. However, this may lead to quick-started, half-baked artifacts that are rolled out too early just because they can be started easily and cheaply. This is a situation that might lead to expensive replacement efforts and a "running in circles" situation. Therefore, a cost-effective runtime environment and a quick and easy development and deployment situation should still be approached carefully, and designed and planned to a certain degree.
Container technologies are another chance to benefit from the cloud's elastic capabilities. They come with some more detailed possibilities for development and configuration in comparison to Serverless Functions. Every Hyperscaler nowadays offers one or more container technologies.
Containers enable you to deploy modularized functionalities into a runtime environment. They abstract from the OS, support all the necessary libraries and dependencies, and can hold any routine or executable that you might need. In comparison to VMs, containers can easily be ported and deployed from one OS to another. The containers that are running in a certain environment use the kernel of the OS and share it with each other.
VMs run their own OSes, apart from each other's, so they will also need far more maintenance and care. Within a container, you can concentrate on the logic and the code you want to implement, without the need to organize the "computer" around that. This also leads to a far smaller footprint for a container compared to VMs. Containers also boot far quicker than VMs and can be available quickly in an on-demand fashion. This leads to systems that can react nearly instantly to all kinds of events. A container failing, for example, can be mitigated by spinning up the image and redirecting the connection to it internally.
Therefore, in terms of their basic usage, containers are stateless to ensure quick. This adds up to the elasticity of cloud services. Developing and versioning of container modules eases their usage and increases the stability of applications based on this technology. You can roll back a buggy deployment easily, and deployments can be modularized down to small services (microservices approach).
Containers can be developed and implemented on your laptop and then deployed to a cloud repository, where they can be instantiated by the container runtime of choice.
In Microsoft Azure, the offering for containers comes from the container registry, where the typical container technologies can be placed and instantiated from. To run containers on Azure, you can use different offerings, such as Azure Container Instances or Azure Kubernetes Service (AKS). Azure Red Hat OpenShift and Azure Batch can also be used to run containers in some situations. It is also worth mentioning Azure Service Fabric and Azure App Service, which can host, use, or orchestrate containers on Azure:
Modern Data Warehouse requirements may reach beyond the functionalities of the typical out-of-the-box components of the ETL/ELT tools. As serverless functions and containers can run a wide variety of programming languages, they can be used to add nearly any function that is not available by default. A typical example is consuming streaming data as it pours into the platform to the queuing endpoints. As some of the streaming components might not offer extended programming functionality, serverless functions and/or containers can add the needed features, and can also add improved performance while keeping the overall solution simple and easy to implement.
Talking about the possibilities created by cloud computing in the field of data, analytics, and AI, I need to mention storage and compute and the distinction between the two. To scale cloud components to fulfill volume or performance requirements, there are often services where the two are closely coupled. Looking at VMs, for example, higher performance measures such as virtual CPUs automatically require more local storage, which will always cause higher costs in both dimensions. This is also true for scaling many databases. Adding vCores to a database will also add memory and will scale diskspace ranges. Looking at the databases on Azure, for example, the diskspace can be influenced in certain ranges, but these ranges are still coupled to the amount of compute that the database consumes.
The development of serverless services in the data and AI sector is leading to new patterns in storage and compute. Many use cases benefit from the separate scaling possibilities that this distinction offers. There are also complex computations on smaller datasets whose internal complexity needs some significant computational power, be it for a high amount of iterations or wider join requirements. Elastically increasing compute for a particular calculation is an opportunity to save money. Why?
As you spin up or scale a computational cluster and take it back down once the computation is done, you only pay for what you consume. The rest of the time, you run the clusters in "keep-the-lights-on" mode at a lower level to fulfill routine requests. If you don't need a compute component, you can take it down and switch it off:
More and more, we even see automatic scaling for compute components, such as in Spark clusters (as shown in the preceding screenshot). You can decide on lower and upper limits for the cluster to instantly react on computational peaks when they occur during a computation. The clusters will even go into hibernation mode if there is no computation going on. This can also be accomplished with the database technology that makes up the Modern Data Warehouse, for example, on Microsoft Azure.
As soon as a database is not experiencing high demand, you can scale it back down to a minimum to keep it working and answer random requests. Or, if it is not needed at all, you just switch it off and avoid any compute costs during that time. This comes in handy when you're working in development or test systems, or databases that are created for certain purposes and aren't needed 24/7. You can even do this with your production systems, which are only used during office times for an ELT/ETL load.
Hyperscaler cloud vendors don't limit you in terms of the amount and formats of data you can put into your storage service. They also often offer different storage tiers, where you can save money by deciding how much data needs to be in hot mode, a slightly more expensive mode for recurring read/write access, or "cool" mode, which is cheaper for longer periods without access. Even archive modes are available. They are the cheapest modes and intended for data that must be kept but will be accessed only in rare cases.
Cloud storage systems act just like a hard drive would in your laptop, but with far greater capacity. Looking at a Modern Data Warehouse that wants to access the storage, you would need sufficient performance when writing to or reading from such a storage. And performance is always worked on at the big cloud vendors.
Talking about Azure specifically, many measures are taken to improve performance, but also stability and reliance. One example is the fact that files are tripled and spread over different discs in the data center; they are also split into blocks, such as on HDFS, for parallel access and performance gains. This, with other, far more sophisticated techniques, increases the reading speed for analysis significantly and adds to the stability, availability, and reliability of the system.
The cloud vendors, with their storage, computational capabilities, and the elasticity of their service offerings, will lay the foundation for you to build successful and financial competitive systems. The nature of the pay-as-you-go models will make it easy for you to get started with a cloud project, pursue a challenge, and succeed with a good price/performance ratio. And, looking at the PaaS offerings, a project can be equipped with the required components in hours instead of weeks or even months. You can react, when the need arises, instead of purchasing new hardware and software over lengthy processes. If a project becomes obsolete or goes down the wrong path for any reason, deleting the related artifacts and eliminating the related cost can be done very easily.
Cloud technology can help you get things done more quickly, efficiently, and at a far lower cost. We will be talking about the advantages of cloud security later in this book.
Companies start building AI projects and massively rely on those statistical functions and the math behind them to predict customer churn, recognize images, detect fraud, mine knowledge, and so much more. ML projects often begin as part of a big data or Data Warehouse project, but this can also be the other way around; that is, the start of an AI or machine learning project often leads to the development of an analytical system.
As my data scientist colleagues tell me, if you want to be able to really predict an event based on incoming data, a machine learning model needs to be trained on quite a large amount of data. The more data you can bring to train a model, the better and more accurate the model will be in the end.
A wonderful experiment to do with image recognition, for example, can be done at www.customvision.ai. You can start by examining one of the example projects there. I like the "Chocolate or Dalmatian" example.
This is a nice experiment that did not need too much of input to enable the image recognizer to distinguish between Stracciatella Chocolate ice cream and Dalmatian dogs. When you try to teach the system on different images and circumstances, you might find out that you need far more training images than six per group.
I have experimented with the service and uploaded images of people in an emergency versus images of people relaxing or doing Yoga or similar. I used around 50 – 60 images for each group and still didn't reach a really satisfying accuracy (74%).
With this experiment, I even created a model with a bias that I first didn't understand myself. There were too many "emergency" cases being interpreted incorrectly as "All good" cases. By discussing this with my data scientist colleagues and examining the training set, we found out why.
There were too many pictures in the "All good" training set that showed people on grass or in nature, with lots of green around them. This, in turn, led the system to interpret green as a signifier of "All good," no matter how big the emergency obviously was. Someone with their leg at a strange angle and a broken arm, in a meadow? The model would interpret it as "All good."
In this sandbox environment, I did no harm at all. But imagine a case where a system is used to help detect emergency situations, and hopefully kickstart an alarm those vital seconds earlier. This is only a very basic example of how the right tool in the wrong hands might cause a lot of damage.
There are so many different cases where machine learning can help increase the accuracy of processes, increase their speed, or save them money because of the right predictions – such as predicting when and why a machine will fail before it actually fails, helping to mine information from massive data, and more.
How does this relate to the Modern Data Warehouse? As we stated previously, the Modern Data Warehouse does not only offer scalable, fast, and secure storage components. It also offers at least one (and, in the context of this book, six) compute component(s) that can interact with the storage services and can be used to create and run machine learning models at scale. The "run" can be implemented in batch mode, near-real time or even in real time, depending on the streaming engine used. That Modern Data Warehouse can then store the results of ML calculations into a suitable presentation layer to provide this data to the downstream consumers, who will process the data further, visualize it, draw their insights from it, and take action. The system can close the loop using the enterprise service bus and the integration services available to feed back insights as parameters for the surrounding systems.
A responsible data scientist will need tools that support them conducting this work properly. The buzzword of the moment in that area is machine learning operations (MLOps). One of the most important steps of creating a responsible AI is having complete traceability/auditability of the source data, and the versions of the datasets used to train and retrain a certain model at a certain timestamp. With this information, the results of the model can also be audited and interpreted. This information is vital when it comes to audits and traceability regarding legal questions, for instance. The collaborative aspects of an MLOps-driven environment are another important factor.
Note
We can find a definition of Responsible AI, following the principles of Fairness, Reliability and Safety, Privacy and Security, Inclusiveness, and Transparency and Accountability, at https://www.microsoft.com/en-us/ai/responsible-ai.
An MLOps environment embedded in the Modern Data Warehouse will be another puzzle part of the bigger picture and helps integrate those principles into the analytical estate of any company. With the tight interconnectivity between data services, storage, compute components, streaming services, IoT technology, ML and AI, and visualization tools, the world of analytics today offers a wide range of possibilities at a far lower cost than ever before. The ease of use of these services and their corresponding productivity is constantly growing.
Given everything we have learned, is the Data Warehouse dead because of the availability of Data Lakes?
Not at all! What if you could combine the two philosophies and benefit from the advantages of both? The terms that are used for this new approach, which is powered by the availability, scalability, and elasticity offered by the cloud vendors, are "the Data Lakehouse" and, the term that I like more, the Modern Data Warehouse.
We will explore this concept in more detail in Chapter 2, Connecting Requirements and Technology.
In this chapter, we examined the differences between Data Warehouses and Data Lakes and the advantages of both approaches. We have the structured model of the Data Warehouse with its security, tuning possibilities, and accessibility for Self-Service BI on the one hand, and the capabilities of Data Lake systems to process vast amounts of data in high performance to support machine learning on the other.
Both concepts, when implemented in isolation, can help solve certain problems. However, even with the growing data, the disparate source systems, their various formats, and the required speed of delivery, as well as the requirements for security and usability, neither can succeed on their own: there is life in the old Data Warehouse yet!
The combination of the two concepts, together with the extended offerings of the Hyperscaler cloud vendors such as virtualization, container offerings, and serverless functions can open new opportunities in terms of the flexibility, agility, and speed of implementation. We are getting the best of both worlds.
In the next chapter, we will discuss a generic architecture sketch. You will learn about the different building blocks of a Modern Data Warehouse approach and how to ask the right questions during your requirements engineering process. In the second part of the next chapter, we will examine Azure Data Services and PaaS components. We'll explore alternative components for different sizes and map the Azure Services to the Modern Data Warehouse architecture.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.