Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta support #71

Open
jbadeau opened this issue Aug 12, 2021 · 3 comments
Open

Delta support #71

jbadeau opened this issue Aug 12, 2021 · 3 comments

Comments

@jbadeau
Copy link

jbadeau commented Aug 12, 2021

Hi,

Is there any further progress on Delta support? If not then would using Raw DataFrameWriter/Reader be an option? This would make it non portable but would solve my immediate issues with using delta with beam.

Any pointers, tips?

Cheers

@iemejia
Copy link
Owner

iemejia commented Aug 12, 2021

Hi, I approached supporting delta by matching Parquet files from the Delta standalone reader. However I was hit by some dependency conflicts that I did not have the time to tackle. Another alternative would be just to 'write a thin' client for this. What matters for Beam are the links to the Parquet files and then you can read those via ParquetIO.readFiles.

If you are up to work on the task this is my WIP branch (I can help you with the review if so but I am too booked to work on this at the moment).
https://github.com/iemejia/beam/tree/deltaio

@jbadeau
Copy link
Author

jbadeau commented Aug 13, 2021

Hi,

Thank for the reply. I will try to get this going next week. Beam is pretty new to me (but I like it allot) so gonna need to dig into the api a bit.

What exactly do you mean by "thin-client"?

@iemejia
Copy link
Owner

iemejia commented Aug 18, 2021

Oversimplifying reading Delta is reading an index of which Parquet files contain the data + the associated schema(s). The current Delta standalone reader library that I used in the PR has some limitations but the biggest (from the Beam PoV) issue is that it leaks the json reading library (jackson) into the classpath which usually produces issues when running it with Beam, so one alternative would be to write a micro client just for reads in a way that did not leak classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants