Reading and writing pandas dataframes to parquet

Parquet is a columnar data storage format that is part of the hadoop ecosystem.

If you are in the habit of saving large csv files to disk as part of your data processing workflow, it can be worth switching to parquet for these type of tasks. It will result in smaller files that are quicker to load.

With large datasets or expensive computations it’s convenient to dump the resulting dataframes to parquet so that you can easily and quickly load them later. For example, if after you initially load your data, you pass it through a series of (sometimes time-consuming steps) to clean and transform it, it can be useful to then dump that dataframe to parquet so that you can load it easily next time you want to use it, or share it with someone else.

To save a dataframe to parquet

df.to_parquet('df.parquet.gzip', compression='gzip')

To load a data frame from parquet

If you have a dataframe saved in parquet format you can do

pd.read_parquet('df.parquet.gzip')

parquet engines

There are a couple of parquet libraries you can use under the hood. The default is pyarrow. However if any of your dataframe columns contain complex objects such as dicts you may want to switch to fastparquet.

pip install fastparquet

df.to_parquet('df.parquet.gzip',  compression='gzip', engine='fastparquet')