Home Data Simplifying Data Engineering and Analytics with Delta

Simplifying Data Engineering and Analytics with Delta

By Anindita Mahapatra
books-svg-icon Book
eBook $37.99 $25.99
Print $46.99 $27.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $37.99 $25.99
Print $46.99 $27.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introduction to Data Engineering
About this book
Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases. In this book, you’ll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You’ll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you’ll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products. By the end of this Delta book, you’ll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.
Publication date:
July 2022
Publisher
Packt
Pages
334
ISBN
9781801814867

 

Chapter 1: Introduction to Data Engineering

"Water, water, everywhere, nor any drop to drink...

Data data everywhere, not a drop of insight!"

With the vast exodus of data around us, it is important to crunch it meaningfully and promptly to extract value from all the noise. This is where data engineering steps in. If collecting data is the first step, drawing useful insights is the next. Data engineering encompasses several personas that come together with their unique individual skill sets and processes to bring this to fruition. Data usually outlives the technology, and it continues to grow. New tools and frameworks come to the forefront to solve a lot of old problems. It is important to understand business requirements, the accompanying tech challenges, and typical shifts in paradigms to solve these age-old problems in a better manner.

By the end of this chapter, you should have an appreciation of the data landscape, the players, and the advances in distributed computing and cloud infrastructure that make it possible to support the high pace of innovation.

In this chapter, we will cover the following topics:

  • The motivation behind data engineering
  • Data personas
  • Big data ecosystem
  • Evolution of data stores
  • Trends in distributed computing
  • Business justification for tech spending
 

The motivation behind data engineering

Data engineering is the process of converting raw data into analytics-ready data that is more accessible, usable, and consumable than its raw format. Modern companies are increasingly becoming data-driven, which means they use data to make business decisions to give them better insights into their customers and business operations. They can use these to improve profitability, reduce costs, and give them a competitive edge in the market. Behind the scenes, a series of tasks and processes are performed by a host of data personas who build reliable pipelines to source, transform, and analyze data so that it is a repeatable and mostly automated process.

Different systems produce different datasets that need to function as individual units and are brought together to provide a holistic view of the state of the business – for example, a customer buying merchandise through different channels such as the web, in-app, or in-store. Analyzing activity in all the channels will help predict the next customer purchase and possibly the next channel type as well. In other words, having all the datasets in one place can help answer questions that couldn't be answered by the individual systems. So, data consolidation is an industry trend that breaks down individual silos. However, each of the systems may have been designed differently, as well as different requirements and service-level agreements (SLAs), and now all of that needs to be normalized and consolidated in a single place to facilitate better analytics.

The following diagram compares the process of farming to that of processing and refining data. In both setups, there are different producers and consumers and a series of refining and packaging steps:

Figure 1.1 – Farming compared to a data pipeline

Figure 1.1 – Farming compared to a data pipeline

In this analogy, there is a farmer, and the process consists of growing crops, harvesting them, and making them available in a grocery store. This produce eventually becomes a ready-to-eat meal. Similarly, a data engineer is responsible for creating ready-to-consume data so that each consumer does not have to invest in the same heavy lifting. Each cook taps into different points of the pipeline and makes different recipes based on the specific needs of the use cases that need to be catered for. However, the freshness and quality of the produce are what make for a delightful meal, irrespective of the recipe that's used.

We are at the interesting conjunction of big data, the cloud, and artificial intelligence (AI), all of which are fueling tremendous innovation in every conceivable industry vertical and generating data exponentially. Data engineering is increasingly important as data drives business use cases in every industry vertical. You may argue that data scientists and machine learning practitioners are the unicorns of the industry, and they can work their magic for business. That is certainly a stretch of the imagination. Simple algorithms and a lot of good reliable data produce better insights than complicated algorithms with inadequate data. Some examples of how pivotal data is to the very existence of some of these businesses are listed in the following section.

Use cases

In this section, we've taken a popular use case from a few industry verticals to highlight how data is being used as a driving force for their everyday operations and the scale of data involved:

  • Security Incident and Event Management (SIEM) cyber security systems for threat detection and prevention.

This involves user activity monitoring and auditing for suspicious activity patterns and entails collecting a large volume of logs across several devices and systems, analyzing them in real time, correlating data, and reporting on findings via alerts and dashboard refreshes.

  • Genomics and drug development in health and life sciences.

The Human Genome project took almost 15 years to complete. A single human genome requires about 100 gigabytes of storage, and it is estimated that by 2025, 40 exabytes of data will be required to process and store all the sequenced genomes. This data helps researchers understand and develop cures that are more targeted and precise.

  • Autonomous vehicles.

Autonomous vehicles use a lot of unstructured image data that's been generated from cameras on the body of the car to make safe driving decisions. It is estimated that an active vehicle generates about 5 TB every hour. Some of it will be thrown away after a decision has been made, but a part of it will be saved both locally as well as transmitted to a data center for long-term trend monitoring.

  • IoT sensors in Industry 4.0 smart factories in manufacturing.

Smart manufacturing and the Industry 4.0 revolution, which are powered by advances in IoT, are enabling a lot of efficiencies in machine and human utilization on the shop floor. Data is at the forefront of scaling these smart factory initiatives with real-time monitoring, predictive maintenance, early alerting, and digital twin technology to create closed-loop operations.

  • Personalized recommendations in retail.

In an omnichannel experience, personalization helps retailers engage better with their customers, irrespective of the channel they choose to engage with, all while picking up the relevant state from the previous channel they may have used. They can address concerns before the customer churns to a competitor. Personalization at scale can not only deliver a percentage lift in sales but can also reduce marketing and sales costs.

  • Gaming/entertainment.

Games such as Fortnite and Minecraft have captivated children and adults alike who spend several hours in a multi-player online game session. It is estimated that Fortnite generates 100 MB of data per user, per hour. Music and video streaming also rely a lot on recommendations for new playlists. Netflix receives more than a million new ratings every day and uses several parameters to bin users to understand similarities in their preferences.

  • Smart agriculture.

The agriculture market in North America is estimated to be worth 6.2 billion US dollars and uses big data to understand weather patterns for smart irrigation and crop planting, as well as to check soil conditions for the right fertilizer dose. John Deere uses computer vision to detect weeds and can localize the use of sprays to help preserve the quality of both the environment and the produce.

  • Fraud detection in the Fintech sector.

Detecting and preventing fraud is a constant effort as fraudsters find new ways to game the system. Because we are constantly transacting online, a lot of digital footprints are left behind. By some estimates, about 10% of insurance company payments are made due to fraud. AI techniques such as biometric data and ML algorithms can detect unusual patterns, which leads to better monitoring and risk assessment so that the user can be alerted before a lot of damage is done.

  • Forecasting use cases across a wide variety of verticals.

Every business has some need for forecasting, either to predict sales, stock inventory, or supply chain logistics. It is not as straightforward as projection – other patterns influence this, such as seasonality, weather, and shifts in micro or macro-economic conditions. Data that's been augmented over several years by additional data feeds helps create more realistic and accurate forecasts.

How big is big data?

90% of the data that's generated thus far has been generated in the last 2 years alone. At the time of writing, it is estimated that 2.5 quintillion (18 zeros) bytes of data is produced every day. A typical commercial aircraft generates 20 terabytes of data per engine every hour it's in flight.

We are just at the beginning stages of autonomous driving vehicles, which rely on data points to operate. The world's population is about 7.7 billion. The number of connected devices is about 10 billion, with portions of the world not yet connected by the internet. So, this number will only grow as the exodus of IoT sensors and other connected devices grows. People have an appetite to use apps and services that generate data, including search functionalities, social media, communication, services such as YouTube and Uber, photo and video services such as Snapchat and Facebook, and more. The following statistics give you a better idea of the data that's generated all around us and how we need to swim effectively through all the waves and turbulences that they create to digest the most useful nuggets of information.

Every minute, the following occurs (approximately):

  • 16 million text messages
  • 1 million Tinder swipes
  • 160 million emails
  • 4 million YouTube videos
  • 0.5 million tweets
  • 0.5 million Snapchat shares

With so much data being generated, there is a need for robust data engineering tools and frameworks and reliable data and analytics platforms to harness this data and make sense of it. This is where data engineering comes to the rescue. Data is as important an asset as code is, so there should be governance around it. Structured data only accounts for 5-10% of enterprise data; semi-structured and unstructured data needs to be added to complete this picture.

Data is the new oil and is at the heart of every business. However, raw data by itself is not going to make a dent in a business. It is the useful insights that are generated from curated data that are the refined consumable oil that businesses aspire for. Data drives ML, which, in turn, gives businesses their competitive advantage. This is the age of digitization, where most successful businesses see themselves as tech companies first. Start-ups have the advantage of selecting the latest digital platforms while traditional companies are all undergoing digital transformations. Why should I care so much for the underlying data? I have highly qualified ML practitioners who are the unicorns of the industry that can use sophisticated algorithms and their special skill sets to make magic!

In this section, we established the importance of curating data since raw data by itself isn't going to make a dent in a business. In the next section, we will explore the influence that curated data has on the effectiveness of ML initiatives.

But isn't ML and AI all the rage today?

AI and ML are catchy buzzwords, and everybody wants to be on the bandwagon and use ML to differentiate their product. However, the hardest part about ML is not ML – it is managing everything else around ML creation. This is shown by Google in one of their papers in 2014 (https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf). Garbage in, garbage out, is true. The magic wand of ML will only work if the boxes surrounding it are well developed and most of them are data engineering tasks. In short, high-quality curated data is the foundational layer of any ML application, and the data engineering practices that curate this data are the backbone that holds it all together:

Figure 1.2 – The hardest part about ML is not ML, but rather everything else around it

Figure 1.2 – The hardest part about ML is not ML, but rather everything else around it

Technologies come and go, so understanding the core challenges around data is critical. As technologists, we create more impact when we align solutions with business challenges. Speed to insights is what all businesses demand and the key to this is data. The data and IT functional areas within an organization that were traditionally viewed as cost centers are now being viewed as revenue-generating sources. Organizations where business and tech cooperate, instead of competing with each other, are the ones most likely to succeed with their data initiatives. Building data services and products involves several personas. In the next section, we will articulate the varying skill sets of these personas within an organization.

 

Understanding the role of data personas

Since data engineering is such a crucial field, you may be wondering who the main players are and what skill sets they possess. Building a data product involves several folks, all of whom need to come together with seamless handoffs to ensure a successful end product or service is created. It would be a mistake to create silos and increase both the number and complexity of integration points as each additional integration is a potential failure point. Data engineering has a fair overlap with software engineering and data science tasks:

Figure 1.3 – Data engineering requires multidisciplinary skill sets

Figure 1.3 – Data engineering requires multidisciplinary skill sets

All these roles require an understanding of data engineering:

  • Data engineers focus on maintaining how the data pipelines that ingest and transform data run. This has a lot in common with a software engineering role coupled with lots of data.
  • BI analysts focus on SQL-based reporting and can be operational or domain-specific subject-matter experts (SMEs) such as financial or supply chain analysts.
  • Data scientists and ML practitioners are statisticians who explore and analyze the data (via Exploratory Data Analysis (EDA)) and use modeling techniques at various levels of sophistication.
  • DevOps and MLOps focus on the infrastructure aspects of monitoring and automation. MLOps is DevOps coupled with the additional task of managing the life cycle of analytic models.
  • ML engineers refer to folks who can span across both the data engineer and data scientist roles.
  • Data leaders are chief data officers – that is, data stewards who are at the top of the food chain in terms of the ultimate governors of data.

The following diagram shows the typical placement of the four main data personas working collaboratively on a data platform to produce business insights to give the company a competitive advantage in the industry:

Figure 1.4 – Data personas working in collaboration

Figure 1.4 – Data personas working in collaboration

Let's take a look at a few of these points in more detail:

  1. DevOps is responsible for ensuring all operational aspects of the data platform and traditionally does a lot of scripting and automation.
  2. Data/ML engineers are responsible for building the data pipeline and taking care of the extract, transform, load (ETL) aspects of the pipeline.
  3. Data scientists of varying skill levels build models.
  4. Business analysts create reporting dashboards from aggregated curated data.
 

Big data ecosystem

The big data ecosystem has a fairly large footprint that's contributed by several infrastructures, analytics (BI and AI) technologies, data stores, and apps. Some of these are open source, while others are proprietary. Some are easy to wield, while others have steeper learning curves. Big data management can be daunting as it brings in another layer of challenges over existing data systems. So, it is important to understand what qualifies as a big data system and know what set of tools should be used for the use case at hand.

What characterizes big data?

Big data was initially characterized with three Vs (volume, velocity, and variety). This involves processing a lot of data coming into a system at high velocity with varying data types. Two more Vs were subsequently added (veracity and value). This list continues to grow and now includes variability and visibility. Let's look at the top five and see what each of them mean:

  • Volume: This is measured by the size of data, both historical and current:
    • The number of records in a file or table
    • The size of the data in gigabytes, terabytes, and so on
  • Velocity: This refers to the frequency at which new data arrives:
    • The batches have a well-defined interval, such as daily or hourly.
    • Real time is either continuous or micro-batch, typically in seconds.
  • Variety: This refers to the structural nature of the data:
    • Structured data is usually relational and has a well-defined schema.
    • Semi-structured data has a self-describing schema that can evolve, such as the XML and JSON formats.
    • Unstructured data refers to free-text documents, audio, and video data that's usually in binary format.
  • Veracity: This refers to the trustworthiness and reliability of the data:
    • Lineage refers to not just the source but also the subsequent systems where transformations took place to ensure that data fidelity is maintained and can be audited. To guarantee such reliability, data lineage must be maintained.
  • Value: This refers to the business impact that the dataset has – that is, how valuable the data is to the business.

Classifying data

Different classification gauges can be used. The common ones are based on the following aspects:

  • As the volume of data increases, we move from regular systems to big data systems. Big data is typically terabytes of data that cannot fit on a single computer node.
  • As the velocity of the data increases, we move toward big data systems specialized in streaming. In batch systems, irrespective of when data arrives, it is processed at a predefined regular interval. In streaming systems, there are two flavors. If it's set to continuous, data is processed as it comes. If it's set to micro-batch, data is aggregated in small batches, typically a few seconds or milliseconds.
  • When it comes to variety – that is, the structure of the data – we move toward the realm of big data systems. In structured data, the schema is well known and stable, so it's assumed to be fairly static and rigid to the definition. With semi-structured data, the schema is built into the data and can evolve. In unstructured data such as images, audio, and video, there is some metadata but no real schema to the binary data that's sent.

The following diagram shows what trends in data characteristics signal a move toward big data systems. For example, demographic data is fairly structured with predefined fields, operational data moves toward the semi-structured realm as schemas evolve, and the most voluminous is behavioral data as it encompasses user sentiment, which is constantly changing and is best captured by unstructured data such as text, audio, and images:

Figure 1.5 – Classifying data

Figure 1.5 – Classifying data

Now that we have covered the different types of data, let's see how much processing needs to be done before it can be consumed.

Reaping value from data

As data is refined and moves further along the pipeline, there is a tradeoff between the value that's added and the cost of the data. In other words, more time, effort, and resources are used, which is why the cost increases, but the value of the data increases as well:

Figure 1.6 – The layers of data value

Figure 1.6 – The layers of data value

The analogy we're using here is that of cutting carbon to create a diamond. The raw data is the carbon, which gets increasingly refined. The longer the processing layers, the more refined and curated the value of the data. However, it is more time-consuming and expensive to produce the artifact.

Top challenges of big data systems

People, technology, and processes are the three prongs that every enterprise has to keep up with. Technology changes around us at a pace that is hard to keep up with and gives us better tools and frameworks. Tools are great but until you train people to use them effectively, you cannot create solutions, which is what a business needs. Sound and effective business processes help you pass information quickly and break data silos.

According to Gartner, the three main challenges of big data systems are as follows:

  • Data silos
  • Fragmented tools
  • People with the skill sets to wield them

The following diagram shows these challenges:

Figure 1.7 – Big data challenges

Figure 1.7 – Big data challenges

Any imbalance or immaturity in these areas results in poor insights. These challenges around data quality and data staleness lead to inaccurate, delayed, and hence unusable insights.

 

Evolution of data systems

We have been collecting data for decades. The flat file storages of the 60s led to the data warehouses of the 80s to Massively Parallel Processing (MPP) and NoSQL databases, and eventually to data lakes. New paradigms continue to be coined but it would be fair to say that most enterprise organizations have settled on some variation of a data lake:

Figure 1.8 – Evolution of big data systems

Figure 1.8 – Evolution of big data systems

Cloud adoption continues to grow with even highly regulated industries such as healthcare and Fintech embracing the cloud for cost-effective alternatives to keep pace with innovation; otherwise, they risk being left behind. People who have used security as the reason for not going to the cloud should be reminded that all the massive data breaches that have been splashing the media in recent years have all been from on-premises setups. Cloud architectures have more scrutiny and are in some ways more governed and secure.

Rise of cloud data platforms

The data challenges remain the same. However, over time, the three major shifts in architecture offerings have been due to the introduction of the following:

  • Data warehouses
  • Hadoop heralding the start of data lakes
  • Cloud data platforms refining the data lake offerings

The use cases that we've been trying to solve for all three generations can be placed into three categories, as follows:

  • SQL-based BI Reporting
  • Exploratory data analytics (EDA)
  • ML

Data warehouses were good at handling modest volume structured data and excelled at BI Reporting use cases, but they had limited support for semi-structured data and practically no support for unstructured data. Their workloads could only support batch processing. Once ingested, the data was in a proprietary format, and they were expensive. So, older data would be dropped in favor of accommodating new data. Also, because they were running at capacity, interactive queries had to wait for ingestion workloads to finish to avoid putting strain on the system. There were no ML capabilities built into these systems.

Hadoop came with the promise of handling large volumes of data and could support all types of data, along with streaming capabilities. In theory, all the use cases were feasible. In practice, they weren't. Schema on read meant that the ingestion path was greatly simplified, and people dumped their data, but the consumption paths were more difficult. Managing the Hadoop cluster was complex, so it was a challenge to upgrade versions of software. Hive was SQL-like and was the most popular of all the Hadoop stack offerings. However, access performance was slow. So, part of the curated data was pushed into data warehouses due to their structure. This meant that data personas were left to stitch two systems that had their fair share of fragility and increased end-to-end latency.

Cloud data platforms were the next entrants who simplified the infrastructure manageability and governance aspects and delivered on the original promise of Hadoop. Extra attention was spent to prevent data lakes from turning into data swamps. The elasticity and scalability of the cloud helped contain costs and made it a worthwhile investment. Simplification efforts led to more adoption by data personas.

The following diagram summarizes the end-to-end flow of big data, along with its requirements in terms of volume, variety, and velocity. The process varies on each platform as the underlying technologies are different. The solutions have evolved from warehouses to Hadoop to cloud data platforms to help serve the three main types of use cases across different industry verticals:

Figure 1.9 – The rise of modern cloud data platforms

Figure 1.9 – The rise of modern cloud data platforms

SQL and NoSQL systems

SQL databases were the forerunners before NoSQL databases arose, which were created with different semantics. There are several categories of NoSQL stores, and they can roughly be classified as follows:

  • Key-Value Stores: For example, AWS S3, and Azure Blob Storage
  • Big Table Stores: For example, DynamoDB, HBase, and Cassandra
  • Document Stores: For example, CouchDB and MongoDB
  • Full Text Stores: For example, Solr and Elastic (both based on Lucene)
  • Graph Data Stores: For example, Neo4j
  • In-memory Data Stores: For example, Redis and MemSQL

While relational systems honor ACID properties, NoSQL systems were designed primarily for scale and flexibility and honored BASE properties, where data consistency and integrity are not the highest concerns.

ACID properties are honored in a transaction, as follows:

  • Atomicity: Either the transaction succeeds or it fails.
  • Consistency: The logic must be correct every time.
  • Isolation: In a multi-tenant setup with numerous operations, proper demarcation is used to avoid collisions.
  • Durability: Once set, the data remains unchanged.

Use cases that contain highly structured data with predictable inputs and outputs, such as a financial system with a money transfer process where consistency is the main requirement.

BASE properties are honored, as follows:

  • Basically Available: The system is guaranteed to be available in the event of a failure.
  • Soft State: The state could change because of multi-node inconsistencies.
  • Eventual Consistency: All the nodes will eventually reconcile on the last state but there may be a period of inconsistency.

This applies to less structured scenarios involving changing schemas, such as a Twitter application scanning words to determine user sentiment. High availability despite failures is the main requirement.

OLTP and OLAP systems

It is useful to classify operational data versus analytical data. Data producers typically push data into Operational Data Stores (ODS). Previously, this data was sent to data warehouses for analytics. In recent times, the trend is changing to push the data into a data lake. Different consumers tap into the data at various stages of processing. Some may require a portion of the data from the data lake to be pushed to a data warehouse or a separate serving layer (which can be NoSQL or in-memory).

Online Transaction Processing (OLTP) systems are transaction-oriented, with continuous updates supporting business operations. Online Analytical Processing (OLAP) systems, on the other hand, are designed for decision support systems that are processing several ad hoc and complex queries to analyze the transactions and produce insights:

Figure 1.10 – OLTP and OLAP systems

Figure 1.10 – OLTP and OLAP systems

Data platform service models

Depending on the skill set of the data team, the timelines, the capabilities, and the flexibilities being sought, a decision needs to be made regarding the right service model. The following table summarizes the model offerings and the questions you should ask to decide on the best fit:

Figure 1.11 – Service model offerings

Figure 1.11 – Service model offerings

The following table further expands on the main value proposition of each service offering, highlighting the key benefits and guidelines on when to adopt them:

Figure 1.12 – How to find the right service model fit

Figure 1.12 – How to find the right service model fit

 

Distributed computing

Scalability refers to a system's ability to adapt to an increase in load without degrading performance. There are two ways to scale a system – vertically and horizontally. Vertical scaling refers to using a bigger instance type with more compute horsepower, while horizontal scaling refers to using more of the same node type to distribute the load.

In general terms, a process is an instance of a program that's being executed. It consists of several activities and each activity is a series of tasks. In the big data space, there is a lot of data to crunch, so there's a need to improve computing speeds by increasing the level of parallelization. There are several multiprocessor architectures, and it is important to understand the nuances to pick linearly scalable architectures that can not only accommodate present volumes but also future increases.

SMP and MPP computing

Both symmetric multi-processing (SMP) and MPP are multiprocessor systems.

As data volume grows, SMP architectures transition to MPP ones. MPP is designed to handle multiple operations simultaneously by several processing units. Each processing unit works independently with its resources, including its operating system and dedicated memory. Let's take a closer look:

  • SMP: All the processing units share the same resources (operating system, memory, and disk storage) and are connected on a system bus. This becomes the choke factor of the architectures scaling linearly:
Figure 1.13 – SMP

Figure 1.13 – SMP

  • MPP: Each processor has its own set of resources and is fully independent and isolated from other processors. Examples of popular MPP databases include Teradata, GreenPlum, Vertica, AWS Redshift, and many more:
Figure 1.14 – MPP

Figure 1.14 – MPP

In the next section, we'll explore Hadoop and Spark, which are newer entrants to the space, and the map/reduce and Resilient Distributed Datasets (RDDs) concepts, which mimic the parallelism constructs of MPP databases.

Parallel and distributed computing

Advances in distributed computing have pushed the envelope on compute speeds and made this process possible. It is important to note that parallel processing is a type of distributed processing. Let's take a closer look:

  • Parallel Processing:

In parallel processing, all the processors have access to a single shared memory (https://en.wikipedia.org/wiki/Shared_memory_architecture) instead of having to exchange information by passing messages between the processors:

Figure 1.15 – Parallel processing

Figure 1.15 – Parallel processing

  • Distributed Processing:

In distributed processing, the processors have access to their own memory pool:

Figure 1.16 – Distributed computing

Figure 1.16 – Distributed computing

The two most popular distributed architectures are Hadoop and Spark. Let's look at them in more detail.

Hadoop

Hadoop is an Apache open source project that started as a Yahoo! project in 2006. It promises to provide an inexpensive, reliable, and scalable framework. Several distributions, such as Cloudera, Hortonworks, MapR, and EMR, have offered packaging variations. It is compatible with many types of hardware where it runs as an appliance. It works with scalable distributed filesystems such as S3, HFTP FS, and HDFS with multiple replications on commodity-grade hardware and has a service-oriented architecture with many open source components.

It has a master-slave architecture that follows the map/reduce model. The three main components of the Hadoop framework are HDFS for storage, YARN for resource management, and Map Reduce as the application layer. The HDFS data is broken into blocks, replicated a certain number of times, and sent to worker nodes where they are processed in parallel. It consists of a series of map and reduce jobs. NameNode keeps track of everything in the cluster. As the resource manager, YARN allocates the resources in a multi-tenant environment. JobTracker and TaskTracker monitor the progress of a job. All the results from the MapReduce stage are then aggregated and written back to disk in HDFS:

Figure 1.17 – Hadoop map/reduce architecture

Figure 1.17 – Hadoop map/reduce architecture

Spark

Spark is an Apache open source project that started in 2012, at AMPLab (https://amplab.cs.berkeley.edu/) at UC Berkeley. It was written in Scala and provides support for the Scala, Java, Python, R, and SQL languages. It has connectors for several disparate providers/consumers. In Spark lingo, a job is broken into several stages and each stage is broken into several tasks that are executed by executors on cores. Data is broken into partitions that are processed in parallel on worker node cores. So, being able to partition effectively and having sufficient cores is what enables Spark to be horizontally scalable:

Figure 1.18 – Spark distributed computing architecture

Figure 1.18 – Spark distributed computing architecture

Spark is a favorite tool in the world of big data, not only for its speed but also its multifaceted capabilities. This makes it favorable for a wide variety of data personas working on a wide range of use cases. It is no wonder that it is regarded as a Swiss Army knife for data processing:

Figure 1.19 – Spark is a Swiss Army knife in the world of data

Figure 1.19 – Spark is a Swiss Army knife in the world of data

Hadoop versus Spark

Spark is ~100x faster in-memory than Hadoop. This is on account of more disk operations in Hadoop, where each map and reduce operation in a job chain goes to disk. Spark, on the other hand, processes and retains data in memory for subsequent steps in a Directed Acyclic Graph (DAG). Spark processes data in RAM using a concept known as a Resilient Distributed Dataset (RDD), which is immutable. So, every transformation is a node in the DAG that is lazily evaluated when it encounters an explicit action. Although Spark is a standalone technology, it was also packaged with the Hadoop ecosystem to provide an alternative to Map Reduce. Hadoop is losing favor and is on the decline, whereas Spark continues to be an industry favorite.

 

Business justification for tech spending

Tech enthusiasts with their love for bleeding-edge tools sometimes forget why they are building a data product. Research and exploration are important for innovation, but it needs to be disciplined and controlled. Not keeping the business counterparts in the loop results in miscommunication and misunderstandings regarding where the effort is going. Ego battles hinder project progress and result in wasted money, time, and people resources, which hurts the business. Tech should always add value and growth to a business rather than being viewed as a cost allocation. So, it is important to demonstrate the value of tech investment.

A joint business-technology strategy helps clarify the role of technology in driving business value to provide a transformation agenda. Key performance indicators (KPIs) and metrics including growth, return on investment (ROI), profitability, market share, earnings per share, margins, and revenue help quantify this investment.

The execution time of these projects is usually significant, so it is important to achieve the end goal in an agile manner in well-articulated baby steps. Some of the benefits may not be immediately realized, so it is important to balance infrastructure gains with productivity and capability gains and consider capital expenditure on initial infrastructure investment (CAPEX) versus ongoing operating expenses (OPEX) over a certain period. In addition, it is always good to do frequent risk assessments and have backup plans. Despite the best projections, costs can escalate to uncomfortable and unpredictable heights, so it is important to invest in a platform with tunable costs so that it can easily be monitored and adjusted when needed. Data is an asset and must be governed and protected from inappropriate access or breaches. Not only are such threats expensive, but they also damage the reputation of the organization:

Figure 1.20 – Mapping the impact of technology on business outcomes

Figure 1.20 – Mapping the impact of technology on business outcomes

Strategy for business transformation to use data as an asset

Data-driven organizations exhibit a culture of analytics. This cannot be confined to just a few premiere groups but rather to the entire organization. There are both cultural and technical challenges to overcome and this is where people, processes, and tools need to come together to bring around sustainable changes. Every business needs a strategy for business transformation. Here are some best practices for managing a big data initiative:

  • Understand the objectives and goals to come up with an overall enterprise strategy.
  • Assess the current state and document all the use cases and data sources.
  • Develop a roadmap that can be shared for collaborating and deciding on the main tools and frameworks to leverage organization-wide.
  • Design for data democratization to allow people to have access to data they have access to.
  • Establish rules around data governance so that workflows can be automated correctly without fear of data exfiltration.
  • Manage change as a continuous cycle of improvement. This means that there should be a center of excellence team that can serve as a hub and spoke model that interfaces with the individual lines of business. Adequate emphasis should be placed on training to engage and educate the team.

Big data trends and best practices

"The old order changeth, yielding place to new…"

(The Passing of Arthur, Alfred Lord Tennyson, 1809–1892)

We are living in an age of fast innovation and technology changes that are happening in the blink of an eye. We can learn from history and learn from the mistakes of those before us. However, we don't have the luxury to analyze everything around us and understand the top trends, though this will give us a better appreciation of the landscape and help us gravitate toward the right technology for our needs.

There is an increase in the adoption of cloud infrastructure because of the following points:

  • It provides affordable and scalable storage.
  • It's an elastic distributed compute infrastructure with pay-as-you-go flexibility.
  • It's a multi-cloud strategy and some on-premises presence to hedge risks.
  • It provides an increase in data consolidation to break down individual data silos in data lakes.
  • Other data stores such as data warehouses continue to live on, while newer ones such as lakehouses and data meshes are being introduced.
  • Unstructured data usage is on the rise.
  • Improved speed to insights.
  • Convergence of big data and ML.
  • Detecting and responding to pattern signals in real time as opposed to batch.
  • Analytics has moved from simple BI reporting to ML and AI as industries move from descriptive analytics to prescriptive and finally predictive.
  • Improved governance and security
  • Data discovery using business and operational enterprise-level meta stores.
  • Data governance to control who has access to what data.
  • Data lineage and data quality to determine how reliable the data is.

Let's summarize some of the best practices for building robust and reliable data platforms:

  • Build decoupled (storage and compute) systems because storage is far cheaper than compute. So, the ability to turn off compute when it's not in use will be a big cost saving. Having a microservices architecture will help manage such changes.
  • Leverage cloud storage, preferably in an open format.
  • Use the right tool for the right job.
  • Break down the data silos and create a single view of the data so that multiple use cases can leverage the same data with different tools.
  • Design data solutions with due consideration to use case-specific trade-offs such as latency, throughput, and access patterns.
  • Log design patterns where you maintain immutable logs for audit, compliance, and traceability requirements.
  • Expose multiple views of the data for consumers with different access privileges instead of copying the datasets multiple times to make slight changes to the data access requirements.
  • There will always be a point where a team will have to decide between whether they build or buy. Speed to insights should guide this decision, irrespective of how smart the team is or whether there is a window of opportunity, and you should not lose it in the pursuit of tech pleasures. The cost of building a solution to cater to an immediate need should be compared with the cost of a missed opportunity.
 

Summary

In this chapter, we covered the role of data engineering in building data products to solve a host of use cases across diverse industry verticals. A whole village of data personas come together to create a sound robust data application that provides valuable insights for business. We briefly looked at the big data landscape and the metamorphosis it has gone through over the years to arrive at present-day modern cloud data platforms. We talked about the various distributed architectures that we can use to crunch data at scale. We emphasized that only by pulling business and tech together can we create a symbiotic and data-driven culture that spurs innovation to put a company ahead of its competitors.

In the next chapter, we will explore data modeling and data formats so that your storage and retrieval operations are optimized for the use case at hand.

About the Author
  • Anindita Mahapatra

    Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.

    Browse publications by this author