Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

web控制台提交作业的问题 #1157

Open
liu-shaobo opened this issue Mar 8, 2024 · 6 comments
Open

web控制台提交作业的问题 #1157

liu-shaobo opened this issue Mar 8, 2024 · 6 comments

Comments

@liu-shaobo
Copy link

liu-shaobo commented Mar 8, 2024

  • 通过web控制台提交作业,使用命令"srun -n 2 hostname",错误日志如下:
srun: error: Unable to create step for job 65: More processors requested than permitted
image * 命令行提交没有问题 image

web控制台提交作业好像不能识别SLURM_NPROCS变量

  • 通过web控制台提交作业,使用命令"srun -n $SLURM_NPROCS hostname",错误日志如下:
srun: error: Invalid numeric value "" for --ntasks.

运行环境 | Environment

- OS: CentOS 7.8
- Scheduler: Slurm 20.11.9
- Docker: 24.0.7
- Docker-compose: 1.25.5
- SCOW cli: v1.4.2
- SCOW: v1.4.2  and v1.2.3
- Adapter: slurm-adapter v1.5.0
@vanstriker
Copy link
Member

直接在shell里 echo $SLURM_NPROCS 有什么输出吗?

@liu-shaobo
Copy link
Author

liu-shaobo commented Mar 8, 2024

控制台的shell输出正常
image

@vanstriker
Copy link
Member

可以在官方demo集群复现吗?
srun -n 2 hostname在demo集群应该是可以跑的

@liu-shaobo
Copy link
Author

liu-shaobo commented Mar 12, 2024

  • 官方demo集群运行 srun -n $SLRUM_NPROCS hostname 出现
    srun: error: Invalid numeric value "" for --ntasks.

  • 官方demo集群运行 srun -n 2 hostname 正常

srun: defined options
srun: -------------------- --------------------
srun: (null)              : hpc01_cn01
srun: jobid               : 3017
srun: job-name            : job-20240312-152400
srun: nodes               : 1
srun: ntasks              : 2
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 3017: nodes(1):`hpc01_cn01', cpu counts: 2(x1)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=3017.0 on host hpc01_cn01, 2 tasks: [0-1]
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node hpc01_cn01, 2 tasks started
srun: launch/slurm: _task_finish: Received task exit notification for 2 tasks of StepId=3017.0 (status=0x0000).
srun: launch/slurm: _task_finish: hpc01_cn01: tasks 0-1: Completed
  • 自己的集群运行 srun -n 2 hostname
    好像是提交任务的单节点核心数的变量cpus-per-task这个导致,但是命令行不会出现这个问题,如果使用 srun -n 2 -c 1 hostname 就不会报错;
image
srun: -------------------- --------------------
srun: (null)              : node001
srun: cpus-per-task       : 2
srun: jobid               : 91
srun: job-name            : job-20240312-152309
srun: mem-per-cpu         : 1G
srun: nodes               : 1
srun: ntasks              : 2
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 91: nodes(1):`node001', cpu counts: 2(x1)
srun: error: Unable to create step for job 91: More processors requested than permitted

@liu-shaobo liu-shaobo changed the title web控制台提交作业的错误 web控制台提交作业的问题 May 7, 2024
@liu-shaobo
Copy link
Author

看的源码里面是 核心数 -c 取单节点核心数“--cpus-per-task”,一般sbatch脚本采用每个节点的任务数 --ntasks-per-node 或--ntasks参数更多!对于mpi任务,感觉这个参数不太合适;

@vanstriker
Copy link
Member

slurm参数多,scow不会全部做成可选项。在实际使用中可以在命令框里写这些参数,命令框里的参数会是优先级最高的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants