Hydra-submitit launcher does not log error messages to log files. #2100

abhiskk · 2022-03-18T05:51:16Z

🚀 Feature Request

Logging error messages in hydra-submitit launches to relevant log files. Currently the error messages are ignored by the hydra-submitit setup and are only piped to the terminal.

Motivation

I have been using the hydra-submitit launcher for multi-node training and have been observing the above behavior. If the messages are logged to the relevant log files it will give us consolidation of the log messages and error message and will make it easy to track experiments. The logging of messages to terminal is also painful in the scenario when you launch hyperparameter searches with a single launch command and the messages from multiple hyperparameter search runs come to the same terminal.

For e.g. if we launch the runs we get a message like this on the terminal:

INFO: launch command: python train_d2_net.py -m         --config-path=../configs         --config-name=dummy.yaml         +mode=cluster         hydra.sweep.dir=/checkpoint/akadian/exps/20220316T012225/20220316T012225LOGS         use_wandb=true hydra.launcher.timeout_min=4320 D2.SOLVER.BASE_LR=0.02,0.002,0.0002,0.00002
[2022-03-16 01:23:26,351][HYDRA] Submitit 'slurm' sweep output dir : /checkpoint/akadian/exps-w3-reasoning/20220316T012225/20220316T012225LOGS
[2022-03-16 01:23:26,352][HYDRA]  #0 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.02
[2022-03-16 01:23:26,358][HYDRA]  #1 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.002
[2022-03-16 01:23:26,363][HYDRA]  #2 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.0002
[2022-03-16 01:23:26,367][HYDRA]  #3 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=2e-05

If any of these experiments crash then the log gets piped to the terminal using which we had submitted the jobs but they are not present in the relevant log files where everything else is logged, like: .submitit/54459356/54459356_0_log.out, .submitit/54459356/54459356_0_log.err.

For eg this error message appeared on the terminal which was used to submit the jobs, but was never piped to the relevant log files.

Error executing job with overrides: ['+mode=cluster', 'use_wandb=True', 'wandb_project=lvis_debug']
Traceback (most recent call last):
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
    _ = r.return_value
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Pitch

Discussed with @jieru-hu, this can be implemented as a native functionality or can be implemented using hydra callbacks.

The text was updated successfully, but these errors were encountered:

jieru-hu · 2022-03-23T17:37:55Z

Thank you @abhiskk . we will address this as part of 1.1.2.

jieru-hu · 2022-03-30T21:35:07Z

thanks @abhiskk for helping with testing the changes with submitit/slurm. This is addressed on both Hydra 1.2 and 1.1 and will be released along with Hydra 1.1.2 soon.

Ubadub · 2023-05-28T10:14:36Z

@jieru-hu @abhiskk Could you explain how this was addressed/completed? As far as I can tell, error logs are still not being produced.

I am having an issue where my jobs are failing immediately (and unpredictably- sometimes they do, sometimes they don't) after starting, and the only error message I see, in he console, is:

submitit.core.utils.UncompletedJobError: Job 12649156_1 (task: 0) with path /mmfs1/gscratch/stf/abhinavp/573-SADTech/src/multirun/05-28-02-43/.submitit/12649156_1/12649156_1_0_result.pkl
has not produced any output (state: FAILED)

I suspect some sort of SLURM issue. Normally, SLURM would publish an actual error message to the error log file, but the plugin is simply swallowing this output. I don't know where it is getting redirected. Please advise.

pipme · 2023-08-22T07:24:52Z

This issue still happens to me with hydra 1.3.2. The error messages are not written into .submitit/54459356/54459356_0_log.out,.submitit/54459356/54459356_0_log.err or hydra's .log file. The error messages are only piped to the terminal.

@abhiskk @jieru-hu Could you please elaborate on how to resolve this? A related issue is #2479 where one wants to exit the process for launching the MULTIRUN experiments.

It seems one needs to add a callback, right?

hydra:
  callbacks:
    log_job_return:
      _target_: hydra.experimental.callbacks.LogJobReturnCallback

pipme · 2023-08-22T07:40:27Z

Specifically for @Ubadub, not related to this issue. I also encountered this strange behavior. After investigating, I think the problem is that submitit by default specifies #SBATCH --error=xxx/.submitit/%A_%a/%A_%a_0_log.err for submission to SLURM while the folders %A_%a are not created yet when submitting the jobs. The folders are created almost instantly after submitting the SLURM job arrays, see here. SLURM can therefore fail silently since it doesn't support non-existed folders for writing the outputs before v23.02. (Also submitit assumes the .pkl files for executing are in those subfolders which might not be true at the time of submission.)

TL;DR: A workaround would be postponing the slurm jobs for a few seconds (60 here).

# @package _global_
# requires hydra-submitit-launcher (available via pip)

defaults:
  - override /hydra/launcher: submitit_slurm
 
hydra:
  launcher:
    additional_parameters:
      error: /dev/null  
      output: /dev/null
      begin: now+60

Specifying the error and output files to \dev\null may or may not be necessary but since the output&error paths for individual array jobs are correctly set internally by submitit with srun --output .submitit/%A_%a/%A_%a_%t_log.out ..., it doesn't harm to add those. Below is a sbatch script example created by submitit before the above modification.

#!/bin/bash

# Parameters
#SBATCH --array=0-1%2
#SBATCH --cpus-per-task=1
#SBATCH --error=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.err
#SBATCH --job-name=__main__
#SBATCH --mem-per-cpu=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.out
#SBATCH --partition=short
#SBATCH --signal=USR2@120
#SBATCH --time=480
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.out --error xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.err python -u -m submitit.core._submit xxx/multiruns/2023-08-14_18-19/.submitit/%j

abhiskk added the enhancement Enhanvement request label Mar 18, 2022

jieru-hu added this to the Hydra 1.1.2 milestone Mar 23, 2022

jieru-hu self-assigned this Mar 23, 2022

jieru-hu added the internal label Mar 23, 2022

jieru-hu mentioned this issue Mar 25, 2022

Add job return logging #2111

Merged

jieru-hu closed this as completed Mar 30, 2022

jieru-hu modified the milestones: Hydra 1.1.2, Hydra 1.2.0 Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hydra-submitit launcher does not log error messages to log files. #2100

Hydra-submitit launcher does not log error messages to log files. #2100

abhiskk commented Mar 18, 2022

jieru-hu commented Mar 23, 2022

jieru-hu commented Mar 30, 2022

Ubadub commented May 28, 2023

pipme commented Aug 22, 2023 •

edited

pipme commented Aug 22, 2023 •

edited

Hydra-submitit launcher does not log error messages to log files. #2100

Hydra-submitit launcher does not log error messages to log files. #2100

Comments

abhiskk commented Mar 18, 2022

🚀 Feature Request

Motivation

Pitch

jieru-hu commented Mar 23, 2022

jieru-hu commented Mar 30, 2022

Ubadub commented May 28, 2023

pipme commented Aug 22, 2023 • edited

pipme commented Aug 22, 2023 • edited

pipme commented Aug 22, 2023 •

edited

pipme commented Aug 22, 2023 •

edited