Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Using Spark with Parquet files


Apache Parquet is a columnar storage format. It is used in many big data applications in the Hadoop ecosystem. Parquet supports very compression and encoding schemes that can give a significant boost to the performance of such applications. In this section, we show you the simplicity with which you can directly read Parquet files into a standard Spark SQL DataFrame. 

Here, we use the reviewsDF created previously from the Amazon reviews contained in a JSON formatted file and write it out in the Parquet format to create the Parquet file. We use coalesce(1) to create a single output file:

scala> reviewsDF.filter("overall < 3").coalesce(1).write.parquet("file:///Users/aurobindosarkar/Downloads/amazon_reviews/parquet")

In the next step, we create a DataFrame from the Parquet file using just one statement:

scala> val reviewsParquetDF = spark.read.parquet("file:///Users/aurobindosarkar/Downloads/amazon_reviews/parquet/part-00000-3b512935-ec11-48fa-8720-e52a6a29416b...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359