Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Databricks Operators to DBFS Interaction #39262

Open
2 tasks done
nivangio opened this issue Apr 25, 2024 · 6 comments
Open
2 tasks done

Extend Databricks Operators to DBFS Interaction #39262

nivangio opened this issue Apr 25, 2024 · 6 comments

Comments

@nivangio
Copy link
Contributor

nivangio commented Apr 25, 2024

Description

Create operators and Hook to interact with Databricks' DBFS (https://docs.databricks.com/api/workspace/dbfs)

Use case/motivation

As per latest databricks plugin (https://github.com/apache/airflow/tree/main/airflow/providers/databricks) there is no possibility to interact with DBFS API.

As I had to do it in my job (and I have it quite developed), I thought it'd be a good idea to share it with the community

So far, I've got:

  • An operator that uploads files to DBFS
  • A hook that interacts with the DBFS API, respecting Databricks' Hooks logic and inheriting from BaseDatabricksHook

As part of the PR, I'd add:

  • Some more operators (getting files, getting files metadata, deleting files)
  • Tests in line with Airflow's test suite

Please LMK if you consider this a relevant contribution or not

Related issues

As one of the DBFS API endpoints uses PUT as verb., I'd need to include a modification in BaseDatabricksHook, because it is not supporting PUT ATM (see https://github.com/apache/airflow/blob/main/airflow/providers/databricks/hooks/databricks_base.py#L584)

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nivangio nivangio added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Apr 25, 2024
@Taragolis
Copy link
Contributor

Maybe it is also good idea to implements DBFS over Object Storage

@Taragolis
Copy link
Contributor

And just wondering why not implement it over the official SDK?

Note about production usage from https://docs.databricks.com/en/dev-tools/sdk-python.html

Note

This feature is in Beta and is okay to use in production.

During the Beta period, Databricks recommends that you pin a dependency on the specific minor version of the Databricks SDK for Python that your code depends on.

@nivangio
Copy link
Contributor Author

Hi @Taragolis . Thx for your reply! TBH, I wasn't aware of the existence of Object Storage. It seems as if many of the things I've implemented were already there. The only thing I cannot find is some sort of cp that enables uploading/downloading data from DBFS. At this point I wonder if it wouldn't be better to extend this and then use it within ObjectStoragePath.

With respect to the SDK, sounds good to me. However, the whole plugin is done pointing directly to the REST Endpoints. I think it may be better in that sense to stick to one strategy (either change everything to point to the SDK or extend it using the REST API)

@Taragolis
Copy link
Contributor

if it wouldn't be better to extend this and then use it within ObjectStoragePath

AIrflow ObjectStorage build in top of the fsspec and I guess extend some methods, like copy

With respect to the SDK, sounds good to me. However, the whole plugin is done pointing directly to the REST Endpoints

Small nit, this one about Airflow Provider, not a Airflow Plugin that is a bit different things.

I think it may be better in that sense to stick to one strategy (either change everything to point to the SDK or extend it using the REST API)

In the long run SDK should replace internal solutions, that is why I propose to use SDK over the direct call to the API

@eladkal eladkal added area:providers good first issue provider:databricks and removed needs-triage label for new issues that we didn't triage yet labels Apr 26, 2024
@nivangio
Copy link
Contributor Author

nivangio commented Apr 26, 2024

In the long run SDK should replace internal solutions, that is why I propose to use SDK over the direct call to the API

Absolutely agree on the idea! I think that's a quite deep change though and I am not sure how that's handled and if it shouldn't be actually part of another ticket (i.e., more of a refactor ticket than a feature add one)

@nivangio
Copy link
Contributor Author

nivangio commented May 1, 2024

@Taragolis @eladkal should I move forward with this as originally posted or do you have sth different in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants