Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Essential PySpark for Scalable Data Analytics

You're reading from  Essential PySpark for Scalable Data Analytics

Product type Book
Published in Oct 2021
Publisher Packt
ISBN-13 9781800568877
Pages 322 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Sreeram Nudurupati Sreeram Nudurupati
Profile icon Sreeram Nudurupati

Table of Contents (19) Chapters

Preface Section 1: Data Engineering
Chapter 1: Distributed Computing Primer Chapter 2: Data Ingestion Chapter 3: Data Cleansing and Integration Chapter 4: Real-Time Data Analytics Section 2: Data Science
Chapter 5: Scalable Machine Learning with PySpark Chapter 6: Feature Engineering – Extraction, Transformation, and Selection Chapter 7: Supervised Machine Learning Chapter 8: Unsupervised Machine Learning Chapter 9: Machine Learning Life Cycle Management Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark Section 3: Data Analysis
Chapter 11: Data Visualization with PySpark Chapter 12: Spark SQL Primer Chapter 13: Integrating External Tools with Spark SQL Chapter 14: The Data Lakehouse Other Books You May Enjoy

Change Data Capture

Generally, operational systems do not maintain historical data for extended periods of time. Therefore, it is essential that an exact replica of the transactional system data be maintained in the data lake along with its history. This has a few advantages, including providing you with a historical audit log of all your transactional data. Additionally, this huge wealth of data can help you to unlock novel business use cases and data patterns that could take your business to the next level.

Maintaining an exact replica of a transactional system in the data lake means capturing all of the changes to every transaction that takes place in the source system and replicating it in the data lake. This process is generally called CDC. CDC requires you to not only capture all the new transactions and append them to the data lake but also capture any deletes or updates to the transactions that happen in the source system. This is not an ordinary feat to achieve on data...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}