Replies: 1 comment
-
Ref dask/fastparquet#542 Short answer: the "append" route is relatively uncommon for dask-parquet, or, I believe, parquet in general. Adding new data files to a dataset may be OK (and a reason for the likes of spark moving away from a global _metadata file), but updating existing ones - I don't know if any framework has code around this. I left comments on the fastparquet issue on how I would go about implementing such behaviour. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I have seen several times this question raised, but without a clean, neat answer.
I am thus witing it here, hoping for some answer (even one saying that dask may not be the appropriate tool for that).
Basically, I intend to store data in a parquet dataset, and append/update data on a daily basis.
Because there can possibly be duplicate, I have in mind to use the merge function.
Let's write a 1st dataset on our filesystem.
4 parquet files have normally been written to the disk, with 2 metadata files,
_common_metadata
and_metadata
.Let's now create a new DataFrame to be added to this existing parquet dataset.
Now, in my craziest dreams, the magic command I am looking for, looking like
ddf = ddf.merge(df2, on=['idx', 'a', 'b' ])
, would:ddf.to_parquet()
, overwrite only modified parquet files (here, it basically is the 2nd filepart.1.parquet
).Instead, running
I can see that the 4 files I initially obtained are fully rewritten.
Please, what is recommnded way of doing such an appending of data?
Is dask suited for that, or should fastparquet be considered instead?
Thanks for your help.
Have a good day,
Bests,
Beta Was this translation helpful? Give feedback.
All reactions