Writing large datasets
In this recipe, you will explore how the choice of different file formats can impact the overall write and read performance. You will explore Parquet, Optimized Row Columnar (ORC), and Feather, and compare their performance to other popular file formats such as JSON and CSV.The three file formats, ORC, Feather, and Parquet, are columnar file formats, making them efficient for analytical needs and showing improved querying performance overall. The three file formats are also supported in Apache Arrow (PyArrow), which offers a standardized in-memory columnar data format for optimized data analysis performance. To persist this in-memory columnar and store it, you can use the pandas to_orc, to_feather, and to_parquet methods to write your data to disk.
Arrow provides the in-memory representation of the data as a columnar format, while Feather, ORC, and Parquet allow us to store this representation on disk.
Getting ready
In this recipe, we will be working with the...