Remote auto-detection does not work with SSH scheduler #3119

jack-morrison · 2024-02-17T18:34:32Z

Opening after discussion with @vkarak. It sounds like we've both hit some issues with auto-detection over SSH at one time or another.

It appears that when using the SSH scheduler, if the remote execution of rfm-detect-job.sh fails, the rsync back to the launching host is not run.

Here's a recent log+traceback where I observe rfm-detect-job.sh failing on the remote host over SSH (for self-inflicted reasons):

[2024-02-17T12:39:51] debug: reframe: Initializing runtime
[2024-02-17T12:39:51] debug: reframe: Initializing system partition 'my-partition'
[2024-02-17T12:39:51] debug: reframe: Initializing system 'test_system'
[2024-02-17T12:39:51] debug: reframe: Initializing modules system 'nomod'
[2024-02-17T12:39:51] debug: reframe: detecting topology info for test_system:my-partition
[2024-02-17T12:39:51] debug: reframe: > no topology file found; auto-detecting...
[2024-02-17T12:39:51] debug: reframe: [CMD] 'rsync --version'
[2024-02-17T12:39:51] info: reframe: Detecting topology of remote partition 'test_system:my-partition': this may take some time...
[2024-02-17T12:39:51] debug: reframe: submitting detection script
[2024-02-17T12:39:51] debug: reframe: --- /home/jmorrison/testing/rfm.tn195dvi/rfm-detect-job.sh ---
#!/bin/bash

_onerror()
{
    exitcode=$?
    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"
    exit $exitcode
}

trap _onerror ERR

python3 -m venv venv.reframe
source venv.reframe/bin/activate
pip install --upgrade pip
pip install reframe-hpc==4.5.0
reframe --detect-host-topology=topo.json
deactivate

--- /home/jmorrison/testing/rfm.tn195dvi/rfm-detect-job.sh ---
[2024-02-17T12:39:51] debug: reframe: [CMD] 'ssh -o BatchMode=yes hostA.cornelisnetworks.com mktemp -td rfm.XXXXXXXX'
[2024-02-17T12:39:51] debug: reframe: [CMD] 'rsync -az -e "ssh -o BatchMode=yes " /home/jmorrison/testing/rfm.tn195dvi/ hostA.cornelisnetworks.com:/tmp/rfm.rJDogAsn/'
[2024-02-17T12:39:51] debug: reframe: [CMD] 'ssh -o BatchMode=yes hostA.cornelisnetworks.com "cd /tmp/rfm.rJDogAsn && bash -l rfm-detect-job.sh"'
[2024-02-17T12:40:00] debug: reframe: job finished
[2024-02-17T12:40:00] warning: reframe: WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: '/home/jmorrison/testing/rfm.tn195dvi/rfm-detect-job.out'
[2024-02-17T12:40:00] debug: reframe: Traceback (most recent call last):
  File "/home/jmorrison/testing/venv/lib/python3.10/site-packages/reframe/frontend/autodetect.py", line 175, in _remote_detect
    _log_contents(job.stdout)
  File "/home/jmorrison/testing/venv/lib/python3.10/site-packages/reframe/frontend/autodetect.py", line 37, in _log_contents
    f'{_contents(filename)}\n'
  File "/home/jmorrison/testing/venv/lib/python3.10/site-packages/reframe/frontend/autodetect.py", line 30, in _contents
    with open(filename) as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jmorrison/testing/rfm.tn195dvi/rfm-detect-job.out'

I'm not sure what the desired behavior would be when remote execution of rfm-detect-job.sh fails - maybe an error, reflective of the Python FileNotFoundError, instead of the WARNING that ReFrame writes (shown below)? Maybe the returning rsync still happens?

Detecting topology of remote partition 'test_system:test_partition': this may take some time...
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: '/home/jmorrison/testing/rfm.aiv3xlc1/rfm-detect-job.out'

The text was updated successfully, but these errors were encountered:

vkarak · 2024-02-17T18:39:20Z

I suspect that the exec step of the SSH scheduler fails and the pull is ignored silently. Maybe this part is at fault:

reframe/reframe/core/schedulers/ssh.py

Lines 166 to 169 in 791b57c

    
           def wait(self, job): 
        
               for step in job.steps.values(): 
        
                   if step.started(): 
        
                       step.wait()

As each subsequent step will only be launched if it's previous has succeeded. This function will return however, whenever the last run step has finished and will not check that the remaining one was not run. The _poll_job() below treats that properly.

vkarak added prio: normal triage labels Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote auto-detection does not work with SSH scheduler #3119

Remote auto-detection does not work with SSH scheduler #3119

jack-morrison commented Feb 17, 2024

vkarak commented Feb 17, 2024 •

edited

Remote auto-detection does not work with SSH scheduler #3119

Remote auto-detection does not work with SSH scheduler #3119

Comments

jack-morrison commented Feb 17, 2024

vkarak commented Feb 17, 2024 • edited

vkarak commented Feb 17, 2024 •

edited