Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimport checkout #10388

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Reimport checkout #10388

wants to merge 2 commits into from

Conversation

dberenbaum
Copy link
Contributor

@dberenbaum dberenbaum commented Apr 12, 2024

Fixes #10255. This is a partial reversion of #9246. I'm not sure this is the cleanest or best way to do it, but it solves the problem.

It builds the hashfile object and tries to checkout from that, falling back to download if it fails.

Note that the build step is still slow for 2.x imports.

Copy link

codecov bot commented Apr 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.46%. Comparing base (dd2b515) to head (d510ad6).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10388      +/-   ##
==========================================
- Coverage   90.72%   90.46%   -0.27%     
==========================================
  Files         501      501              
  Lines       38865    38885      +20     
  Branches     5618     5619       +1     
==========================================
- Hits        35262    35176      -86     
- Misses       2961     3038      +77     
- Partials      642      671      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dberenbaum
Copy link
Contributor Author

Ping @skshetry

@skshetry
Copy link
Member

I'll need to investigate. But we set cache.dir to point to the local cache, which should have made the DVCFileSystem use that cache.

config["cache"]["dir"] = self.repo.cache.local_cache_dir

I am a bit hesitant of this PR as it goes against #9246, and adds another (complex) codepath.

@dberenbaum
Copy link
Contributor Author

Yes, I looked into it, and the local cache dir is set, but it's not enough because it's doing download/copy and not checkout.

download() is defined here:

dvc/dvc/fs/__init__.py

Lines 47 to 76 in 7df8b0c

def download(
fs: "FileSystem", fs_path: str, to: str, jobs: Optional[int] = None
) -> int:
from dvc.scm import lfs_prefetch
from .callbacks import TqdmCallback
with TqdmCallback(desc=f"Downloading {fs.name(fs_path)}", unit="files") as cb:
# NOTE: We use dvc-objects generic.copy over fs.get since it makes file
# download atomic and avoids fsspec glob/regex path expansion.
if fs.isdir(fs_path):
from_infos = [
path for path in fs.find(fs_path) if not path.endswith(fs.flavour.sep)
]
if not from_infos:
localfs.makedirs(to, exist_ok=True)
return 0
to_infos = [
localfs.join(to, *fs.relparts(info, fs_path)) for info in from_infos
]
else:
from_infos = [fs_path]
to_infos = [to]
if isinstance(fs, DVCFileSystem):
lfs_prefetch(fs, from_infos)
cb.set_size(len(from_infos))
jobs = jobs or fs.jobs
generic.copy(fs, from_infos, localfs, to_infos, callback=cb, batch_size=jobs)
return len(to_infos)

It calls generic.copy() (https://github.com/iterative/dvc-objects/blob/ae8b95d7ea3bbcf46fc5a09eddc7c4f614023d7b/src/dvc_objects/fs/generic.py#L69), which runs fs.get_file() operations on paths and knows nothing about the cache.

I don't think there is a much simpler way to do it than this PR, and I tried to make it as narrow as possible. It uses #9246 to rely on the local cache, building the object and passing it to dvc-data's checkout. We don't have imported_file.dvc, so there is no way to do a higher-level checkout operation AFAICT.

@skshetry
Copy link
Member

skshetry commented Apr 29, 2024

which runs fs.get_file() operations on paths and knows nothing about the cache.

fs is a DVCFileSystem, and knows about the cache.

@dberenbaum
Copy link
Contributor Author

fs is a DVCFileSystem, and knows about the cache.

Yes, but it is doing a copy from that fs to localfs. It is not linking, and even if I try to enable linking, it fails because it is working across filesystems.

@Honzys
Copy link

Honzys commented Jun 5, 2024

Hello, is there any update with this merge request? It would help us a ton if it gets merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

import: local cache is ignored when importing data that already exist in cache
3 participants