Best practise to update a parquet dataset? #7012

yohplala · 2020-12-26T21:26:15Z

yohplala
Dec 26, 2020

Hi,

I have seen several times this question raised, but without a clean, neat answer.
I am thus witing it here, hoping for some answer (even one saying that dask may not be the appropriate tool for that).

Basically, I intend to store data in a parquet dataset, and append/update data on a daily basis.
Because there can possibly be duplicate, I have in mind to use the merge function.

Let's write a 1st dataset on our filesystem.

from dask import dataframe as dd
import pandas as pd
a=list('aabbccddeef')
b=list(range(len(a)))
df = pd.DataFrame(dict(a=a, b=b),
                  index = pd.date_range(start='20200101',end='20200106', periods=len(a)))
df.index.name = 'idx'
ddf = dd.from_pandas(df, npartitions=5) # bug in dask, getting 4 partitions
path = '~/Documents/code/draft/data/'
file = path + 'example.parquet'
ddf.to_parquet(file)

4 parquet files have normally been written to the disk, with 2 metadata files, _common_metadata and _metadata.

Let's now create a new DataFrame to be added to this existing parquet dataset.

df2 = pd.DataFrame(dict(a=['z'], b=[4]),
                   index = [pd.Timestamp('2020/01/02 02:00:00')])
df2.index.name = 'idx'

Now, in my craziest dreams, the magic command I am looking for, looking like ddf = ddf.merge(df2, on=['idx', 'a', 'b' ]), would:

check df2 index, and from the min and max, load only in ddf the required partitions.
proceed to the merge
when calling ddf.to_parquet(), overwrite only modified parquet files (here, it basically is the 2nd file part.1.parquet).

Instead, running

ddf = ddf.merge(df2, how='outer', on=['idx', 'a', 'b'])
ddf = ddf.repartition(npartitions=4) # to reset to 4 partitions as in the initial parquet dataset before merging
ddf.to_parquet(file)

I can see that the 4 files I initially obtained are fully rewritten.

Please, what is recommnded way of doing such an appending of data?
Is dask suited for that, or should fastparquet be considered instead?

Thanks for your help.
Have a good day,
Bests,

martindurant · 2020-12-29T14:25:34Z

martindurant
Dec 29, 2020

Ref dask/fastparquet#542
cc @rjzamora

Short answer: the "append" route is relatively uncommon for dask-parquet, or, I believe, parquet in general. Adding new data files to a dataset may be OK (and a reason for the likes of spark moving away from a global _metadata file), but updating existing ones - I don't know if any framework has code around this. I left comments on the fastparquet issue on how I would go about implementing such behaviour.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practise to update a parquet dataset? #7012

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Best practise to update a parquet dataset? #7012

yohplala Dec 26, 2020

Replies: 1 comment

martindurant Dec 29, 2020

yohplala
Dec 26, 2020

martindurant
Dec 29, 2020