Grouping data in Spark and different Spark joins
We will start with one of the most important data manipulation techniques: grouping and joining data. When we are doing data exploration, grouping data based on different criteria becomes essential to data analysis. We will look at how we can group different data using groupBy.
Using groupBy in a DataFrame
We can group data in a DataFrame based on different criteria – for example, we can group data based on different columns in a DataFrame. We can also apply different aggregations, such as sum or average, to this grouped data to get a holistic view of data slices.
For this purpose, in Spark, we have the groupBy operation. The groupBy operation is similar to groupBy in SQL in that we can do group-wise operations on these grouped datasets. Moreover, we can specify multiple groupBy criteria in a single groupBy statement. The following example shows how to use groupBy in PySpark. We will use the DataFrame salary data we created...