The creation of a DataFrame can be done in several ways:
- By executing SQL queries
- Loading external data such as Parquet, JSON, CSV, text, Hive, JDBC, and so on
- Converting RDDs to data frames
A DataFrame can be created by loading a CSV file. We will look at a CSV statesPopulation.csv, which is being loaded as a DataFrame.
The CSV has the following format of US states populations from years 2010 to 2016.
| State | Year | Population | 
| Alabama | 2010 | 4785492 | 
| Alaska | 2010 | 714031 | 
| Arizona | 2010 | 6408312 | 
| Arkansas | 2010 | 2921995 | 
| California | 2010 | 37332685 | 
Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection.
scala> val statesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation... 
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                 
                