You're reading from Data Engineering with Scala and Spark

Product type Book

Published in Jan 2024

Publisher Packt

ISBN-13 9781804612583

Pages 300 pages

Edition 1st Edition

Languages

Concepts

Data Engineering

Authors (3):

Eric Tome

Rupam Bhattacharjee

David Radford

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1 – Introduction to Data Engineering, Scala, and an Environment Setup

2. Chapter 1: Scala Essentials for Data Engineers

3. Chapter 2: Environment Setup

4. Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark

5. Chapter 3: An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

6. Chapter 4: Working with Databases

7. Chapter 5: Object Stores and Data Lakes

8. Chapter 6: Understanding Data Transformation

9. Chapter 7: Data Profiling and Data Quality

10. Part 3 – Software Engineering Best Practices for Data Engineering in Scala

11. Chapter 8: Test-Driven Development, Code Health, and Maintainability

12. Chapter 9: CI/CD with GitHub

13. Part 4 – Productionalizing Data Engineering Pipelines – Orchestration and Tuning

14. Chapter 10: Data Pipeline Orchestration

15. Chapter 11: Performance Tuning

16. Part 5 – End-to-End Data Pipelines

17. Chapter 12: Building Batch Pipelines Using Spark and Scala

18. Chapter 13: Building Streaming Pipelines Using Spark and Scala

19. Index

Why subscribe?

20. Other Books You May Enjoy

Understanding data skewing, indexing, and partitioning

Like with any data processing system, all of the greatest hardware will only produce mediocre results. There is no magic bullet that will solve poor data layouts. The fastest disk, processing chips, and network will not negate the need to plan for well-thought-out indexing and partitioning strategies. Data skew can sneak into processing pipelines or queries and bring them to a crawl. These three critical aspects need to be planned for and monitored to prevent degradation to data processing and querying. We’ll learn more about them in the following sections.

Data skew

Data skew is a common problem when utilizing distributed data systems such as Apache Spark. It will show up when some processing partitions are significantly larger than others, resulting in some tasks finishing quickly while waiting for others to complete. This can result in under-utilized compute, long processing times, and out-of-memory errors. Joins...

The rest of the chapter is locked

You're reading from Data Engineering with Scala and Spark

Table of Contents (21) Chapters

Understanding data skewing, indexing, and partitioning

Data skew

Unlock this book and the full library FREE for 7 days

Authors (3)

Personalised recommendations for you