[Question] pyiceberg 0.6.0 #350

gui-elastic · 2024-02-28T23:26:57Z

Hello,

Recently, the pyiceberg 0.6.0 version was released which allows writing iceberg tables without needing tools like Spark and Trino.

I was about to write a custom plugin to implement the writing feature, however, I see that when using the external materialization with a custom plugin, first the outputted data is stored locally and then is read and ingested to the final source, however for Iceberg and Delta is does not seem to be a good solution. Would be good instead of storing the data on disk, simply load an Arrow Dataframe and then write to the final destination (e.g., s3 in Iceberg format).

I saw this thread: #332 (comment), so I would like to ask you if there is any ETA to implement this feature. It would be an amazing feature to even use for production workloads with a Data Lakehouse architecture.

This explains well what needs to be fixed to use the iceberg writer in the best way possible: #284 (comment)

milicevica23 · 2024-02-29T21:01:31Z

Hi @gui-elastic,
What you explained above I already somehow described here #284 (comment), and this is the reason why I have started to implement refactoring; the problem there is that we have to make sure that we don't break the current process and there are also some open questions which I still have to figure out.

Saying that, in the next few days, I will try to fill the gaps for the refactoring and am waiting for the feedback on the general code flow. The implementation of the Iceberg afterwards should be straightforward because you will get the arrow table / record batch directly into the store function of the plugin

Happy to hear your feedback

gui-elastic · 2024-03-01T22:30:46Z

Hey @milicevica23

I simply loved the idea. I already think that dbt-duckdb is an amazing project, but with this improvement, it will be on another level, being used for Data Lakehouse architectures, and interacting with Delta and Iceberg tables (reading and writing).

When this refactoring is merged, please let me know, I will be glad to test it. If needed, also on writing custom plugins.

milicevica23 · 2024-03-01T23:08:49Z

Yes, I think that too, and I believe that this improvement will bring a bunch of new use cases that can be done where everything that speaks arrow can be integrated
Maybe you would be interested in our blog, where we describe how we “futuristically” imagine the possible direction

https://georgheiler.com/2023/12/11/dagster-dbt-duckdb-as-new-local-mds/

I would encourage you to subscribe to the refactoring pull request and look into the code; I am happy to chat/discuss it. You can find me in dbt slack

gui-elastic · 2024-03-01T23:29:40Z

Thank you so much!

Just to confirm, the refactoring PR is this one #332, correct?

I will take a look at the blog post right now. Thx!

MRocholl · 2024-06-04T08:26:02Z

Hey @milicevica23!
Any update on this? I see that the PR has been stale/draft for some time. Is there any way to advance and push this forward?
Thanks for the amazing work :)

milicevica23 · 2024-06-04T20:14:04Z

Hi @MRocholl, I can say that I don't work on this feature right now because I am swamped privately. When I was doing this refactoring, I didn't have time to go over and comprehend all the options and breaking changes produced by this pull request and guarantee that all the use cases would work as expected.
I am not sure how to proceed here. Maybe I could take a look again from the time management perspective, but we have to find a proper way to ensure that no different breaking changes are produced, and this is a real challenge.
The points and status should be already documented in the PR so there is no more to add from that side
I am happy for suggestions and discuss how to go further
cc @jwills

jwills · 2024-06-05T18:28:28Z

Yeah I think the ideal here is always to rely on DuckDB + the extension to do this reading/writing itself as much as possible vs. having dbt-duckdb do it (and in the process turning into its own sort of data catalog-type thing, which is really not what I was going for when I started down this path, but here we are.) This pattern seems to work well for e.g. Postgres and MySQL via the ATTACH functionality and I'm hopeful that we will have the same support in place over time for external systems like Iceberg and Delta.

Just like @milicevica23, I'm super busy with the actual job I am paid to do (which unfortunately doesn't involve all that much DuckDB.)

MRocholl · 2024-06-06T09:06:31Z

Thank you both for the fast reply. As @jwills said, I believe a lot can be done already with the extensions that duckdb ships by itself by hooking a post-hook and using copy statements or by using the attach functionality.
I might take a jab at the iceberg plugin via pyiceberg in a PR, but will have to see if I can make time.
The easiest way would be to wait for duckdb team to come up with extensions for all of the options eventually.
Thank you anyways @milicevica23 @jwills for the work you already put into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] pyiceberg 0.6.0 #350

[Question] pyiceberg 0.6.0 #350

gui-elastic commented Feb 28, 2024 •

edited

milicevica23 commented Feb 29, 2024

gui-elastic commented Mar 1, 2024

milicevica23 commented Mar 1, 2024

gui-elastic commented Mar 1, 2024

MRocholl commented Jun 4, 2024

milicevica23 commented Jun 4, 2024

jwills commented Jun 5, 2024

MRocholl commented Jun 6, 2024

[Question] pyiceberg 0.6.0 #350

[Question] pyiceberg 0.6.0 #350

Comments

gui-elastic commented Feb 28, 2024 • edited

milicevica23 commented Feb 29, 2024

gui-elastic commented Mar 1, 2024

milicevica23 commented Mar 1, 2024

gui-elastic commented Mar 1, 2024

MRocholl commented Jun 4, 2024

milicevica23 commented Jun 4, 2024

jwills commented Jun 5, 2024

MRocholl commented Jun 6, 2024

gui-elastic commented Feb 28, 2024 •

edited