Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Simplifying Data Engineering and Analytics with Delta

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product type Book
Published in Jul 2022
Publisher Packt
ISBN-13 9781801814867
Pages 334 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Anindita Mahapatra Anindita Mahapatra
Profile icon Anindita Mahapatra

Table of Contents (18) Chapters

Preface 1. Section 1 – Introduction to Delta Lake and Data Engineering Principles
2. Chapter 1: Introduction to Data Engineering 3. Chapter 2: Data Modeling and ETL 4. Chapter 3: Delta – The Foundation Block for Big Data 5. Section 2 – End-to-End Process of Building Delta Pipelines
6. Chapter 4: Unifying Batch and Streaming with Delta 7. Chapter 5: Data Consolidation in Delta Lake 8. Chapter 6: Solving Common Data Pattern Scenarios with Delta 9. Chapter 7: Delta for Data Warehouse Use Cases 10. Chapter 8: Handling Atypical Data Scenarios with Delta 11. Chapter 9: Delta for Reproducible Machine Learning Pipelines 12. Chapter 10: Delta for Data Products and Services 13. Section 3 – Operationalizing and Productionalizing Delta Pipelines
14. Chapter 11: Operationalizing Data and ML Pipelines 15. Chapter 12: Optimizing Cost and Performance with Delta 16. Chapter 13: Managing Your Data Journey 17. Other Books You May Enjoy

How to choose the right data format

Not all tools support all of the data formats. Every tool reads data off disk in chunks of blocks (KB/MB/GB), that is, minimizing these fetches helps improve the speed of access to data. Conversely, a single read for a single record brings back a lot more data than you may want, so caching it may help with subsequent queries. Different systems have different default block sizes. To choose the right data format, you need to consider several factors, such as the following:

  • What is the optimal tradeoff between cost, performance, and throughput considerations of ingestion and access patterns?
  • Are you constrained by storage or memory or CPU or I/O?
  • How large is a file? If your data is not splittable, we lose the parallelism that allows fast queries.
  • How many columns are being stored, and how many columns are used for the analysis?
  • Does your data change over time? If it does, how often does it happen, and how does it change?...
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}