You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running fMRIPrep on some OpenNeuro datasets. Due to some issues with handling branches across compute nodes, we ended up settling on "run everything, then datalad save". This leads to a situation where we run datalad save, and datalad passes 8.5k files to git annex as command line arguments.
This leads to the error message (first line truncated with [...]):
[WARNING] Received an exception CommandError(CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false annex add --json --json-error-messages -c annex.dotfiles=true -- sourcedata/freesurfer/sub-26/label/rh.perirhinal_exvivo.thresh.label [...] sub-30/ses-action01/func/sub-30_ses-action01_task-action_run-12_from-orig_to-boldref_mode-image_desc-hmc_xfm.txt' failed with exitcode 1 under /scratch1/03201/jbwexler/openneuro_derivatives/derivatives/fmriprep/ds004488-fmriprep [info keys: stdout_json] [err: 'git-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: git-annex: failed to create OS threadfailed to create OS thread: Resource temporarily unavailable: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: failed to create OS thread: Resource temporarily unavailablegit-annex: internal error: Itimer: Failed to spawn thread: Resource temporarily unavailable (GHC version 9.0.2 for x86_64_unknown_linux) Please report this as a GHC bug: https://www.haskell.org/ghc/reportabugfatal: the remote end hung up unexpectedlygit-annex: internal error: Itimer: Failed to spawn thread: Resource temporarily unavailable (GHC version 9.0.2 for x86_64_unknown_linux) Please report this as a GHC bug: https://www.haskell.org/ghc/reportabugerror: git-annex filter-process died of signal 6fatal: the remote end hung up unexpectedlyerror: git-annex filter-process died of signal 6fatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlygit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)error: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failedgit-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: fork: resource exhausted (Resource temporarily unavailable)fatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlyfatal: the remote end hung up unexpectedlygit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)error: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failedgit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)error: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failedgit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable) git-annex: git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)error: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failedgit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)git-annex: failed to create OS thread: Resource temporarily unavailableerror: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failederror: external filter 'git-annex filter-process' failedgit-annex: git: createProcess: posix_spawnp: resource exhausted (Resource temporarily unavailable)error: external filter 'git-annex filter-process' failedgit-annex: failed to create OS thread: Resource temporarily unavailablefatal: the remote end hung up unexpectedlyadd: 1 failed']). Canceling not-yet running jobs and waiting for completion of running. You can force earlier forceful exit by Ctrl-C. [INFO] Canceled 0 out of 0 jobs. 0 left running.
Using git annex add . before calling datalad save resolves the issue, and that's what we will do, but I figured I should make a report that can be referenced and searched.
I believe that what is happening is that the argv is exceeding the page size, causing thread spawning to copy (or allocate) more pages than usual, and running up against virtual memory limits.
What steps will reproduce the problem?
This script aimed to simulate datalad's behavior by passing a large number of command line arguments. I used Python because in the past I've seen argument list limits in bash, but I didn't actually try running it in bash:
Using datalad save instead of the python <<END ... also produces the issue. You may want to tweak the ulimit for your system. I found that if I went 10MB higher it succeeded, 10MB lower and different things than thread spawning failed. Nothing failed in quite the same way that seen above.
I believe it's the virtual memory that's the problem, not the user processes, though there may be an interaction. When attempting to replicate on a local machine, constraining virtual memory led to thread spawning errors (among other things; it's fiddly), while limiting processes led to forking errors.
This node has 112 cores, which may be causing problems for Haskell. I found haskell/cabal#2576 where someone was having similar issues with cabal on a 32-node system with tight ulimits.
Testing the above "reproduction" by switching to ls -1 | git annex add [ARGS] --batch, the problem is resolved. @yarikoptic pointed me at #6977, which seems like it should improve the situation by using stdin instead of argv to pass files.
Have you had any success using DataLad before?
Yes!
The text was updated successfully, but these errors were encountered:
What is the problem?
We are running fMRIPrep on some OpenNeuro datasets. Due to some issues with handling branches across compute nodes, we ended up settling on "run everything, then datalad save". This leads to a situation where we run
datalad save
, and datalad passes 8.5k files to git annex as command line arguments.This leads to the error message (first line truncated with
[...]
):Using
git annex add .
before callingdatalad save
resolves the issue, and that's what we will do, but I figured I should make a report that can be referenced and searched.I believe that what is happening is that the
argv
is exceeding the page size, causing thread spawning to copy (or allocate) more pages than usual, and running up against virtual memory limits.What steps will reproduce the problem?
This script aimed to simulate datalad's behavior by passing a large number of command line arguments. I used Python because in the past I've seen argument list limits in bash, but I didn't actually try running it in bash:
Using
datalad save
instead of thepython <<END ...
also produces the issue. You may want to tweak the ulimit for your system. I found that if I went 10MB higher it succeeded, 10MB lower and different things than thread spawning failed. Nothing failed in quite the same way that seen above.DataLad information
Additional context
We were running on a Frontera login node, which may have been unwise. The ulimits are:
I believe it's the virtual memory that's the problem, not the user processes, though there may be an interaction. When attempting to replicate on a local machine, constraining virtual memory led to thread spawning errors (among other things; it's fiddly), while limiting processes led to forking errors.
This node has 112 cores, which may be causing problems for Haskell. I found haskell/cabal#2576 where someone was having similar issues with
cabal
on a 32-node system with tight ulimits.Testing the above "reproduction" by switching to
ls -1 | git annex add [ARGS] --batch
, the problem is resolved. @yarikoptic pointed me at #6977, which seems like it should improve the situation by using stdin instead of argv to pass files.Have you had any success using DataLad before?
Yes!
The text was updated successfully, but these errors were encountered: