Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: hangs when pulling many files from GCS remote and one fails #10417

Open
dluks opened this issue May 7, 2024 · 2 comments
Open

import: hangs when pulling many files from GCS remote and one fails #10417

dluks opened this issue May 7, 2024 · 2 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks

Comments

@dluks
Copy link

dluks commented May 7, 2024

Bug Report

import/pull: hangs when pulling many files from GCS and one (or a few) fails

Description

I have a directory in a data registry that is tracked with DVC which contains ~1.6k files. I pushed it to a Google Cloud Storage remote, and now I am trying to either dvc import or dvc pull the data from the GCS remote to a new machine. This works well for ~99% of the files, but sometimes a few seem to simply fail silently and never actually download.

When this happens, the entire process hangs once all successful downloads have completed, leaving only the hung/failed/paused (not sure) downloads remaining. They never seem to resume and I am forced to Ctrl+C out of the process.

This is less of an issue when using dvc pull as I can simply re-run the command and it will only download the missing files, but with dvc import I am forced to re-run the process.

Reproduce

  1. dvc init (data registry project)
  2. dvc add <large directory containing many moderately-sized files (each around 30MB)>
  3. dvc remote add <google cloud storage bucket>
  4. dvc push <google cloud storage>
  5. git commit -am "commit msg" && git push
  6. (new machine or repository) git init <new-project>
  7. dvc import https://github.com/<data-registry-repo> path/to/large-dir

Most files finish downloading, but some fail (or hang) silently and prevent the entire dvc import from completing:

image

I eventually have to Ctrl-C to exit the process or else it hangs for hours. Here is the full output before and after the Ctrl-C interrupt:

$ (traits) ➜  cit-sci-traits git:(wip) ✗ dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote gcs --force --verbose   
2024-05-07 22:49:38,950 DEBUG: v3.50.1 (pip), CPython 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35                                                                                             
2024-05-07 22:49:38,950 DEBUG: command: /home/daniel/.local/bin/dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote
 gcs --force --verbose                                                                                                                                                                                  
2024-05-07 22:49:39,321 DEBUG: Removing output 'data/raw/all_tracheophyta_non-cult_2024-04-10' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc'.                                           
2024-05-07 22:49:39,322 DEBUG: Removing '/home/daniel/projects/ufr/cit-sci-traits/data/raw/all_tracheophyta_non-cult_2024-04-10'                                                                        
Importing 'data/gbif/all_tracheophyta_non-cult_2024-04-10 (https://github.com/GeoSense-Freiburg/panops-data-registry)' -> 'data/raw/all_tracheophyta_non-cult_2024-04-10'                               
2024-05-07 22:49:39,406 DEBUG: Computed stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' md5: '4f2d2baf24cfc67c5da509212106b2af'                                                              
2024-05-07 22:49:39,406 DEBUG: 'md5' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' changed.                                                                                             
2024-05-07 22:49:39,407 DEBUG: Creating external repo https://github.com/GeoSense-Freiburg/panops-data-registry@None                                                                                    
2024-05-07 22:49:39,407 DEBUG: erepo: git clone 'https://github.com/GeoSense-Freiburg/panops-data-registry' to a temporary dir                                                                          
2024-05-07 22:49:40,906 DEBUG: Lockfile '../../../../../dvc.lock' needs to be updated.                                                                                                                  
2024-05-07 22:49:41,101 DEBUG: failed to load ('data', 'gbif', 'all_tracheophyta_non-cult_2024-04-10') from storage local (/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5) - [Errno 2] No
 such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'                                                                          
Traceback (most recent call last):                                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage                                                                       
    _load_from_object_storage(trie, entry, storage)                                                                                                                                                     
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage                                                                
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)                                                                                                                 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load                                                                                   
    with obj.fs.open(obj.path, "r") as fobj:                                                                                                                                                            
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open                                                                                      
    return self.fs.open(path, mode=mode, **kwargs)                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open                                                                                     
    return open(path, mode=mode, encoding=encoding)  # noqa: SIM115                                                                                                                                     
FileNotFoundError: [Errno 2] No such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'                                           
                                                                                                                                                                                                        
 98% Downloading all_tracheophyta_non-cult_2024-04-10|████████████████████████████████████████████████████████████████████████████████████████████████████████▎  |1.54k/1.58k [54:11<00:53, 1.37s/files]
2024-05-07 23:47:02,580 ERROR: interrupted by the user                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                      
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 264, in _get                                                                                   
    list(callback.wrap(it))                                                                                                                                                                             
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/fsspec/callbacks.py", line 157, in wrap                                                                                         
    for item in iterable:
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 37, in imap_unordered                                                                           
    done, tasks = futures.wait(tasks, return_when=futures.FIRST_COMPLETED)phyta_non-cult_2024-04-10.parquet/000000                                                14.2M/72.8M [13:40<07:01,     146kB/s]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 307, in wait                         
    waiter.event.wait(timeout)                    
  File "/usr/lib/python3.10/threading.py", line 607, in wait                                        
    signaled = self._cond.wait(timeout)           
  File "/usr/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/commands/imp.py", line 15, in run
    self.repo.imp(
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp.py", line 44, in imp
    return self.imp_url(path, out=out, erepo=erepo, frozen=True, **kwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/scm_context.py", line 143, in run
    return method(repo, *args, **kw)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp_url.py", line 86, in imp_url
    stage.run(jobs=jobs, no_download=no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
    return call()
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs) 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 603, in run
    self._sync_import(dry, force, kwargs.get("jobs", None), no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
    return call()                                                                                                                                                                               [4/1937]
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs) 
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 640, in _sync_import
    sync_import(self, dry, force, jobs, no_download)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/imports.py", line 61, in sync_import
    stage.deps[0].download(stage.outs[0], jobs=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/dependency/base.py", line 48, in download
    fs_download(self.fs, self.fs_path, to.fs_path, jobs=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/fs/__init__.py", line 75, in download
    generic.copy(fs, from_infos, localfs, to_infos, callback=cb, batch_size=jobs)
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 97, in copy
    return _get(
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 262, in _get
    with executor:
  File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 44, in __exit__
    self.shutdown(wait=True, cancel_futures=cancel_futures)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

2024-05-07 23:47:02,586 DEBUG: Analytics is enabled.                                                
2024-05-07 23:47:02,644 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v']    
2024-05-07 23:47:02,678 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v'] with pid 314508                                                                                                
2024-05-07 23:47:02,682 DEBUG: Removing '/tmp/tmp2elzelv6dvc-clone'                                 
2024-05-07 23:47:02,688 DEBUG: Removing '/tmp/tmp47j66_8ldvc-cache'                                 
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000048                                               687k/57.1M [57:19<1:08:10,    14.5kB/s]
  4%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000023                                               482k/12.9M [57:19<3:50:00,      947B/s]
 15%|█▍        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000025                                                5.72M/38.4M [57:20<03:16,     175kB/s]
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000058                                             460k/39.3M [57:19<106:54:50,      106B/s]
  0%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000007                                              276k/79.7M [57:20<14:38:23,    1.58kB/s]
 16%|█▋        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000005                                              4.48M/27.2M [57:19<6:26:13,    1.03kB/s]
 34%|███▍      |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000010                                                11.5M/33.8M [57:19<01:55,     202kB/s]
 37%|███▋      |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000008                                                13.5M/36.7M [57:19<03:14,     125kB/s]
 20%|██        |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000000                                                14.9M/72.8M [57:20<06:56,     146kB/s]
  3%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000055                                             1.08M/39.3M [57:19<15:07:16,      736B/s]
  0%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000013                                             266k/67.4M [57:20<165:16:44,      118B/s]
  3%|▎         |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000001                                             1.96M/60.1M [57:20<22:28:28,      754B/s]
  1%|          |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000059                                               251k/39.4M [57:19<9:11:19,    1.24kB/s]

Expected

I expect dvc import to detect download timeouts and either restart the hanging download or to complete the process with a warning and to then allow for a partial dvc import to attempt to gather only the missing files instead of requiring a complete re-download of the data.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        gs (gcsfs = 2024.3.1),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2024.4.1)
Config:
        Global: /home/daniel/.config/dvc
        System: /etc/xdg/xdg-cinnamon/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Caches: local
Remotes: local, gs, gs
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1545d141359143110c3029289a778e59

Additional Information (if any):

@dluks
Copy link
Author

dluks commented May 8, 2024

Update

I found a workaround by running dvc import --no-downloadto first import the .dvcfile only, and then using dvc pull several times to download the remaining data after the above-mentioned failures.

Note: this does not solve the underlying bug, but it is a workaround.

@dberenbaum
Copy link
Contributor

Thanks for the report @dluks and that workaround sounds like a good way around it for now. I'll keep this open but not sure it's actionable at the moment without a more consistent way to reproduce.

@dberenbaum dberenbaum added performance improvement over resource / time consuming tasks A: data-sync Related to dvc get/fetch/import/pull/push p2-medium Medium priority, should be done, but less important and removed p2-medium Medium priority, should be done, but less important labels May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

2 participants