Performing aggregations with Apache Spark
In this recipe, we will discuss how to perform aggregations on DataFrames in Apache Spark. We will use Python as our primary programming language and the PySpark API and go over the various techniques for aggregating your data in Apache Spark.
How to do it...
- Import the libraries: Import the required libraries and create a SparkSessionobject:from pyspark.sql import SparkSession from pyspark.sql.functions import col, max, count, min, approx_count_distinct from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType spark = (SparkSession.builder     .appName("perform-aagregations")    .master("spark://spark-master:7077")    .config("spark.executor.memory", "512m")    .getOrCreate()) spark.sparkContext.setLogLevel("ERROR")
- Read file: Read the netfix_titles.csvfile using theread...
 
                                             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
                