You're reading from Simplifying Data Engineering and Analytics with Delta

Product typeBook

Published inJul 2022

PublisherPackt

ISBN-139781801814867

Edition1st Edition

Concepts

Big Data

Author (1)

Anindita Mahapatra

Evolution of data systems

We have been collecting data for decades. The flat file storages of the 60s led to the data warehouses of the 80s to Massively Parallel Processing (MPP) and NoSQL databases, and eventually to data lakes. New paradigms continue to be coined but it would be fair to say that most enterprise organizations have settled on some variation of a data lake:

Figure 1.8 – Evolution of big data systems

Cloud adoption continues to grow with even highly regulated industries such as healthcare and Fintech embracing the cloud for cost-effective alternatives to keep pace with innovation; otherwise, they risk being left behind. People who have used security as the reason for not going to the cloud should be reminded that all the massive data breaches that have been splashing the media in recent years have all been from on-premises setups. Cloud architectures have more scrutiny and are in some ways more governed and secure.

Rise of cloud data platforms

The data challenges remain the same. However, over time, the three major shifts in architecture offerings have been due to the introduction of the following:

Data warehouses
Hadoop heralding the start of data lakes
Cloud data platforms refining the data lake offerings

The use cases that we've been trying to solve for all three generations can be placed into three categories, as follows:

SQL-based BI Reporting
Exploratory data analytics (EDA)
ML

Data warehouses were good at handling modest volume structured data and excelled at BI Reporting use cases, but they had limited support for semi-structured data and practically no support for unstructured data. Their workloads could only support batch processing. Once ingested, the data was in a proprietary format, and they were expensive. So, older data would be dropped in favor of accommodating new data. Also, because they were running at capacity, interactive queries had to wait for ingestion workloads to finish to avoid putting strain on the system. There were no ML capabilities built into these systems.

Hadoop came with the promise of handling large volumes of data and could support all types of data, along with streaming capabilities. In theory, all the use cases were feasible. In practice, they weren't. Schema on read meant that the ingestion path was greatly simplified, and people dumped their data, but the consumption paths were more difficult. Managing the Hadoop cluster was complex, so it was a challenge to upgrade versions of software. Hive was SQL-like and was the most popular of all the Hadoop stack offerings. However, access performance was slow. So, part of the curated data was pushed into data warehouses due to their structure. This meant that data personas were left to stitch two systems that had their fair share of fragility and increased end-to-end latency.

Cloud data platforms were the next entrants who simplified the infrastructure manageability and governance aspects and delivered on the original promise of Hadoop. Extra attention was spent to prevent data lakes from turning into data swamps. The elasticity and scalability of the cloud helped contain costs and made it a worthwhile investment. Simplification efforts led to more adoption by data personas.

The following diagram summarizes the end-to-end flow of big data, along with its requirements in terms of volume, variety, and velocity. The process varies on each platform as the underlying technologies are different. The solutions have evolved from warehouses to Hadoop to cloud data platforms to help serve the three main types of use cases across different industry verticals:

Figure 1.9 – The rise of modern cloud data platforms

SQL and NoSQL systems

SQL databases were the forerunners before NoSQL databases arose, which were created with different semantics. There are several categories of NoSQL stores, and they can roughly be classified as follows:

Key-Value Stores: For example, AWS S3, and Azure Blob Storage
Big Table Stores: For example, DynamoDB, HBase, and Cassandra
Document Stores: For example, CouchDB and MongoDB
Full Text Stores: For example, Solr and Elastic (both based on Lucene)
Graph Data Stores: For example, Neo4j
In-memory Data Stores: For example, Redis and MemSQL

While relational systems honor ACID properties, NoSQL systems were designed primarily for scale and flexibility and honored BASE properties, where data consistency and integrity are not the highest concerns.

ACID properties are honored in a transaction, as follows:

Atomicity: Either the transaction succeeds or it fails.
Consistency: The logic must be correct every time.
Isolation: In a multi-tenant setup with numerous operations, proper demarcation is used to avoid collisions.
Durability: Once set, the data remains unchanged.

Use cases that contain highly structured data with predictable inputs and outputs, such as a financial system with a money transfer process where consistency is the main requirement.

BASE properties are honored, as follows:

Basically Available: The system is guaranteed to be available in the event of a failure.
Soft State: The state could change because of multi-node inconsistencies.
Eventual Consistency: All the nodes will eventually reconcile on the last state but there may be a period of inconsistency.

This applies to less structured scenarios involving changing schemas, such as a Twitter application scanning words to determine user sentiment. High availability despite failures is the main requirement.

OLTP and OLAP systems

It is useful to classify operational data versus analytical data. Data producers typically push data into Operational Data Stores (ODS). Previously, this data was sent to data warehouses for analytics. In recent times, the trend is changing to push the data into a data lake. Different consumers tap into the data at various stages of processing. Some may require a portion of the data from the data lake to be pushed to a data warehouse or a separate serving layer (which can be NoSQL or in-memory).

Online Transaction Processing (OLTP) systems are transaction-oriented, with continuous updates supporting business operations. Online Analytical Processing (OLAP) systems, on the other hand, are designed for decision support systems that are processing several ad hoc and complex queries to analyze the transactions and produce insights:

Figure 1.10 – OLTP and OLAP systems

Data platform service models

Depending on the skill set of the data team, the timelines, the capabilities, and the flexibilities being sought, a decision needs to be made regarding the right service model. The following table summarizes the model offerings and the questions you should ask to decide on the best fit:

Figure 1.11 – Service model offerings

The following table further expands on the main value proposition of each service offering, highlighting the key benefits and guidelines on when to adopt them:

Figure 1.12 – How to find the right service model fit

You have been reading a chapter from

Simplifying Data Engineering and Analytics with Delta

Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages