Part 2 – Data Ingestion, Transformation, Cleansing, and Profiling Using Scala and Spark
In this part, Chapter 3 introduces Apache Spark as a scalable data processing framework, covering its basics, Scala application development, and the Dataset/DataFrame APIs. Chapter 4 explores relational databases in data pipelines, highlighting Spark’s JDBC API. Chapter 5 discusses the rise of data lakes and lake houses, while Chapter 6 delves into advanced Spark data transformation. Chapter 7 focuses on data quality with the Deequ library for checks and metrics.
This part has the following chapters: