Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case: Large datalad saves with tight ulimits on many-core machines can fail #7568

Open
effigies opened this issue Mar 7, 2024 · 0 comments

Comments

@effigies
Copy link
Contributor

effigies commented Mar 7, 2024

What is the problem?

We are running fMRIPrep on some OpenNeuro datasets. Due to some issues with handling branches across compute nodes, we ended up settling on "run everything, then datalad save". This leads to a situation where we run datalad save, and datalad passes 8.5k files to git annex as command line arguments.

This leads to the error message (first line truncated with [...]):

[WARNING] Received an exception CommandError(CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex add --json --json-error-messages -c annex.dotfiles=true -- sourcedata/freesurfer/sub-26/label/rh.perirhinal_exvivo.thresh.label [...] sub-30/ses-action01/func/sub-30_ses-action01_task-action_run-12_from-orig_to-boldref_mode-image_desc-hmc_xfm.txt' failed with exitcode 1 under /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds004488-fmriprep [info keys: stdout_json] [err: 'git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: git-annex: failed to create OS threadfailed to create OS thread: Resource temporarily unavailable
: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: failed to create OS thread: Resource temporarily unavailable
git-annex: internal error: Itimer: Failed to spawn thread: Resource temporarily unavailable
    (GHC version 9.0.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
fatal: the remote end hung up unexpectedly
git-annex: internal error: Itimer: Failed to spawn thread: Resource temporarily unavailable
    (GHC version 9.0.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
error: git-annex filter-process died of signal 6
fatal: the remote end hung up unexpectedly
error: git-annex filter-process died of signal 6
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
git-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
fatal: the remote end hung up unexpectedly
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)             git-annex: git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)

git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
git-annex: failed to create OS thread: Resource temporarily unavailable
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
error: external filter 'git-annex filter-process' failed
git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)
error: external filter 'git-annex filter-process' failed
git-annex: failed to create OS thread: Resource temporarily unavailable
fatal: the remote end hung up unexpectedly
add: 1 failed']). Canceling not-yet running jobs and waiting for completion of running. You can force earlier forceful exit by Ctrl-C. 
[INFO] Canceled 0 out of 0 jobs. 0 left running.

Using git annex add . before calling datalad save resolves the issue, and that's what we will do, but I figured I should make a report that can be referenced and searched.

I believe that what is happening is that the argv is exceeding the page size, causing thread spawning to copy (or allocate) more pages than usual, and running up against virtual memory limits.

What steps will reproduce the problem?

This script aimed to simulate datalad's behavior by passing a large number of command line arguments. I used Python because in the past I've seen argument list limits in bash, but I didn't actually try running it in bash:

#!/bin/bash
  
if [ -d newrepo ]; then
  chmod -R u+w newrepo
  rm -rf newrepo
fi

git init newrepo
pushd newrepo
git annex init

for i in {0000..9999}; do  
  echo $i > $i  
done 

ulimit -v 380000 

python <<END
from subprocess import run

base = "git -c diff.ignoreSubmodules=none -c core.quotepath=false annex add --json --json-error-messages -c annex.dotfiles=true --".split()
cmd = base + [f"{i:04d}" for i in range(10000)]
print(cmd[:5] + ["..."] + cmd[-5:])
run(cmd)  
END

popd

Using datalad save instead of the python <<END ... also produces the issue. You may want to tweak the ulimit for your system. I found that if I went 10MB higher it succeeded, 10MB lower and different things than thread spawning failed. Nothing failed in quite the same way that seen above.

DataLad information

# WTF
## configuration <SENSITIVE, report disabled by configuration>
## credentials
  - keyring:
    - active_backends:
      - PlaintextKeyring with no encyption v.1.0 at /home1/03201/jbwexler/.local/share/python_keyring/keyring_pass.cfg
    - config_file: /home1/03201/jbwexler/.config/python_keyring/keyringrc.cfg
    - data_root: /home1/03201/jbwexler/.local/share/python_keyring
## datalad
  - version: 0.19.3
## dependencies
  - annexremote: 1.6.0
  - boto: 2.49.0
  - cmd:annex: 10.20230215-gd24914f2a
  - cmd:bundled-git: 2.39.2
  - cmd:git: 2.39.2
  - cmd:ssh: 8.5p1
  - cmd:system-git: 2.40.1
  - cmd:system-ssh: 8.5p1
  - humanize: 4.4.0
  - iso8601: 1.1.0
  - keyring: 23.13.1
  - keyrings.alt: 4.2.0
  - msgpack: 1.0.4
  - platformdirs: 2.6.0
  - requests: 2.29.0
## environment
  - GIT_EXEC_PATH: /opt/apps/git/2.24.1/libexec/git-core
  - GIT_TEMPLATE_DIR: /opt/apps/git/2.24.1/share/git-core/templates
  - LANG: en_US.UTF-8
  - PATH: /work2/01329/poldrack/software/fsl/fsl-6.0.4/bin:/work2/01329/poldrack/software/fsl/fsl-6.0.4/bin:/work2/03201/jbwexler/frontera/anaconda3/envs/main/bin:/work2/03201/jbwexler/frontera/anaconda3/condabin:/opt/apps/xalt/xalt/bin:/work/01329/poldrack/software/nodejs/node-v17.1.0-linux-x64/bin:/work/01329/poldrack/tacc-software/launch:/opt/apps/tacc-apptainer/1.1.8/bin:/opt/apps/hwloc/1.11.12/bin:/opt/apps/cmake/3.24.2/bin:/opt/apps/intel19/python3/3.7.0/bin:/opt/apps/autotools/1.2/bin:/opt/apps/git/2.24.1/bin:/opt/intel/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin:/opt/intel/compilers_and_libraries_2020.1.217/linux/bin/intel64:/opt/apps/gcc/8.3.0/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ddn/ime/bin:.:/opt/ddn/ime/bin
  - PYTHONPATH: /opt/apps/intel19/impi19_0/python3/3.7.0/lib/python3.7/site-packages
## extensions
  - container:
    - description: Containerized environments
    - entrypoints:
      - datalad_container.containers_add.ContainersAdd:
        - class: ContainersAdd
        - module: datalad_container.containers_add
        - names:
          - containers-add
          - containers_add
      - datalad_container.containers_list.ContainersList:
        - class: ContainersList
        - module: datalad_container.containers_list
        - names:
          - containers-list
          - containers_list
      - datalad_container.containers_remove.ContainersRemove:
        - class: ContainersRemove
        - module: datalad_container.containers_remove
        - names:
          - containers-remove
          - containers_remove
      - datalad_container.containers_run.ContainersRun:
        - class: ContainersRun
        - module: datalad_container.containers_run
        - names:
          - containers-run
          - containers_run
    - module: datalad_container
    - version: 1.1.8
  - next:
    - description: What is next in DataLad
    - entrypoints:
      - datalad_next.create_sibling_webdav.CreateSiblingWebDAV:
        - class: CreateSiblingWebDAV
        - module: datalad_next.create_sibling_webdav
        - names:
          - create-sibling-webdav
      - datalad_next.credentials.Credentials:
        - class: Credentials
        - module: datalad_next.credentials
        - names:
      - datalad_next.tree.TreeCommand:
        - class: TreeCommand
        - module: datalad_next.tree
        - names:
          - tree
    - module: datalad_next
    - version: 0.6.3
  - osf:
    - description: DataLad extension for OSF support
    - entrypoints:
      - datalad_osf.create_sibling_osf.CreateSiblingOSF:
        - class: CreateSiblingOSF
        - module: datalad_osf.create_sibling_osf
        - names:
          - create-sibling-osf
          - create_sibling_osf
      - datalad_osf.credentials.OSFCredentials:
        - class: OSFCredentials
        - module: datalad_osf.credentials
        - names:
          - osf-credentials
          - osf_credentials
    - module: datalad_osf
    - version: 0.2.3.1
## git-annex
  - build flags:
    - Assistant
    - Webapp
    - Pairing
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Benchmark
    - Feeds
    - Testsuite
    - S3
    - WebDAV
  - dependency versions:
    - aws-0.22.1
    - bloomfilter-2.0.1.0
    - cryptonite-0.29
    - DAV-1.3.4
    - feed-1.3.2.1
    - ghc-9.0.2
    - http-client-0.7.13.1
    - persistent-sqlite-2.13.1.0
    - torrent-10000.1.1
    - uuid-1.3.15
    - yesod-1.6.2.1
  - key/value backends:
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
    - X*
  - operating system: linux x86_64
  - remote types:
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - httpalso
    - borg
    - hook
    - external
  - supported repository versions:
    - 8
    - 9
    - 10
  - upgrade supported from repository versions:
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7
    - 8
    - 9
    - 10
  - version: 10.20230215-gd24914f2a
## location
  - path: /scratch1/03201/jbwexler/openneuro_derivatives
  - type: directory
## metadata.extractors
## metadata.filters
## metadata.indexers
## python
  - implementation: CPython
  - version: 3.10.11
## system
  - distribution: centos/7/Core
  - encoding:
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - filesystem:
    - CWD:
      - path: /scratch1/03201/jbwexler/openneuro_derivatives
    - HOME:
      - path: /home1/03201/jbwexler
    - TMP:
      - path: /tmp
  - max_path_length: 302
  - name: Linux
  - release: 3.10.0-1160.90.1.el7.x86_64
  - type: posix
  - version: #1 SMP Thu May 4 15:21:22 UTC 2023

Additional context

We were running on a Frontera login node, which may have been unwise. The ulimits are:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 766769
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 300
virtual memory          (kbytes, -v) 8388608
file locks                      (-x) unlimited

I believe it's the virtual memory that's the problem, not the user processes, though there may be an interaction. When attempting to replicate on a local machine, constraining virtual memory led to thread spawning errors (among other things; it's fiddly), while limiting processes led to forking errors.

This node has 112 cores, which may be causing problems for Haskell. I found haskell/cabal#2576 where someone was having similar issues with cabal on a 32-node system with tight ulimits.

Testing the above "reproduction" by switching to ls -1 | git annex add [ARGS] --batch, the problem is resolved. @yarikoptic pointed me at #6977, which seems like it should improve the situation by using stdin instead of argv to pass files.

Have you had any success using DataLad before?

Yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant