-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta support #71
Comments
Hi, I approached supporting delta by matching Parquet files from the Delta standalone reader. However I was hit by some dependency conflicts that I did not have the time to tackle. Another alternative would be just to 'write a thin' client for this. What matters for Beam are the links to the Parquet files and then you can read those via ParquetIO.readFiles. If you are up to work on the task this is my WIP branch (I can help you with the review if so but I am too booked to work on this at the moment). |
Hi, Thank for the reply. I will try to get this going next week. Beam is pretty new to me (but I like it allot) so gonna need to dig into the api a bit. What exactly do you mean by "thin-client"? |
Oversimplifying reading Delta is reading an index of which Parquet files contain the data + the associated schema(s). The current Delta standalone reader library that I used in the PR has some limitations but the biggest (from the Beam PoV) issue is that it leaks the json reading library (jackson) into the classpath which usually produces issues when running it with Beam, so one alternative would be to write a micro client just for reads in a way that did not leak classes. |
Hi,
Is there any further progress on Delta support? If not then would using Raw DataFrameWriter/Reader be an option? This would make it non portable but would solve my immediate issues with using delta with beam.
Any pointers, tips?
Cheers
The text was updated successfully, but these errors were encountered: