How joins work in SparkÂ
In this recipe, you will learn how query joins are executed by the Spark optimizer using different types of sorting algorithms such as SortMerge and BroadcastHash joins. You will learn how to identify which algorithm has been used by looking at the DAG that Spark generates. You will also learn how to use the hints that are provided in the queries to influence the optimizer to use a specific join algorithm.
Getting ready
To follow along with this recipe, run the cells in the 3-5.Joins notebook, which you can find in your local cloned repository, in the Chapter03 folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).
Upload the csvFiles folders, which can be found in the Common/Customer and Common/Orders folders in your local cloned repository, to the ADLS Gen-2 account in the rawdata filesystem. You will need to create two folders called Customer and Orders in the rawdata filesystem:
Figure...