In this code-intensive chapter, we will present key data munging techniques used to transform raw data to a usable format for analysis. We start with some general data munging steps that are applicable in a wide variety of scenarios. Then, we shift our focus to specific types of data including time-series data, text, and data preprocessing steps for Spark MLlib-based machine learning pipelines. We will use several Datasets to illustrate these techniques.
In this chapter, we shall learn:
- What is data munging?
 - Explore data munging techniques
 - Combine data using joins
 - Munging on textual data
 - Munging on time-series data
 - Dealing with variable length records
 - Data preparation for machine learning pipelines