Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets: delta lake and huggingface #10363

Open
dberenbaum opened this issue Mar 19, 2024 · 1 comment
Open

datasets: delta lake and huggingface #10363

dberenbaum opened this issue Mar 19, 2024 · 1 comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Contributor

Following up on #10313 and related new features specifying datasets as dependencies, we can add more types of supported datasets:

This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.

Delta Lake example:

from dvc.api import get_dataset

ds_info = get_dataset("mytable")
df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])

Hugging Face example:

from dvc.api import get_dataset

ds_info = get_dataset("mydataset")
dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])
@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important feature is a feature A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature and removed feature is a feature labels Mar 19, 2024
@skshetry

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants