Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] pyiceberg 0.6.0 #350

Open
gui-elastic opened this issue Feb 28, 2024 · 8 comments
Open

[Question] pyiceberg 0.6.0 #350

gui-elastic opened this issue Feb 28, 2024 · 8 comments

Comments

@gui-elastic
Copy link

gui-elastic commented Feb 28, 2024

Hello,

Recently, the pyiceberg 0.6.0 version was released which allows writing iceberg tables without needing tools like Spark and Trino.

I was about to write a custom plugin to implement the writing feature, however, I see that when using the external materialization with a custom plugin, first the outputted data is stored locally and then is read and ingested to the final source, however for Iceberg and Delta is does not seem to be a good solution. Would be good instead of storing the data on disk, simply load an Arrow Dataframe and then write to the final destination (e.g., s3 in Iceberg format).

I saw this thread: #332 (comment), so I would like to ask you if there is any ETA to implement this feature. It would be an amazing feature to even use for production workloads with a Data Lakehouse architecture.

This explains well what needs to be fixed to use the iceberg writer in the best way possible: #284 (comment)

@milicevica23
Copy link
Contributor

Hi @gui-elastic,
What you explained above I already somehow described here #284 (comment), and this is the reason why I have started to implement refactoring; the problem there is that we have to make sure that we don't break the current process and there are also some open questions which I still have to figure out.

Saying that, in the next few days, I will try to fill the gaps for the refactoring and am waiting for the feedback on the general code flow. The implementation of the Iceberg afterwards should be straightforward because you will get the arrow table / record batch directly into the store function of the plugin

Happy to hear your feedback

@gui-elastic
Copy link
Author

Hey @milicevica23

I simply loved the idea. I already think that dbt-duckdb is an amazing project, but with this improvement, it will be on another level, being used for Data Lakehouse architectures, and interacting with Delta and Iceberg tables (reading and writing).

When this refactoring is merged, please let me know, I will be glad to test it. If needed, also on writing custom plugins.

@milicevica23
Copy link
Contributor

Yes, I think that too, and I believe that this improvement will bring a bunch of new use cases that can be done where everything that speaks arrow can be integrated
Maybe you would be interested in our blog, where we describe how we “futuristically” imagine the possible direction

https://georgheiler.com/2023/12/11/dagster-dbt-duckdb-as-new-local-mds/

I would encourage you to subscribe to the refactoring pull request and look into the code; I am happy to chat/discuss it. You can find me in dbt slack

@gui-elastic
Copy link
Author

Thank you so much!

Just to confirm, the refactoring PR is this one #332, correct?

I will take a look at the blog post right now. Thx!

@MRocholl
Copy link

MRocholl commented Jun 4, 2024

Hey @milicevica23!
Any update on this? I see that the PR has been stale/draft for some time. Is there any way to advance and push this forward?
Thanks for the amazing work :)

@milicevica23
Copy link
Contributor

Hi @MRocholl, I can say that I don't work on this feature right now because I am swamped privately. When I was doing this refactoring, I didn't have time to go over and comprehend all the options and breaking changes produced by this pull request and guarantee that all the use cases would work as expected.
I am not sure how to proceed here. Maybe I could take a look again from the time management perspective, but we have to find a proper way to ensure that no different breaking changes are produced, and this is a real challenge.
The points and status should be already documented in the PR so there is no more to add from that side
I am happy for suggestions and discuss how to go further
cc @jwills

@jwills
Copy link
Collaborator

jwills commented Jun 5, 2024

Yeah I think the ideal here is always to rely on DuckDB + the extension to do this reading/writing itself as much as possible vs. having dbt-duckdb do it (and in the process turning into its own sort of data catalog-type thing, which is really not what I was going for when I started down this path, but here we are.) This pattern seems to work well for e.g. Postgres and MySQL via the ATTACH functionality and I'm hopeful that we will have the same support in place over time for external systems like Iceberg and Delta.

Just like @milicevica23, I'm super busy with the actual job I am paid to do (which unfortunately doesn't involve all that much DuckDB.)

@MRocholl
Copy link

MRocholl commented Jun 6, 2024

Thank you both for the fast reply. As @jwills said, I believe a lot can be done already with the extensions that duckdb ships by itself by hooking a post-hook and using copy statements or by using the attach functionality.
I might take a jab at the iceberg plugin via pyiceberg in a PR, but will have to see if I can make time.
The easiest way would be to wait for duckdb team to come up with extensions for all of the options eventually.
Thank you anyways @milicevica23 @jwills for the work you already put into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants