Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memorize job nodes and exclude broken nodes #15

Open
cmeesters opened this issue Jan 9, 2024 · 0 comments
Open

Memorize job nodes and exclude broken nodes #15

cmeesters opened this issue Jan 9, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@cmeesters
Copy link
Collaborator

Motivation:

  1. when jobs are submitted on a cluster with persistent local node space, it might be advantageous to re-submit jobs to the very same node(s) to avoid overhead in stage-ins or downloads. Note that this might have limited use, as persistent local node disk space is rare and only useful on clusters without much competition (else other users will likely use the node in between and all scratch will be deleted anyway.)
  2. when jobs fail due to broken cluster nodes (which might be detected automatically and the nodes in questions will be closed for submission on most clusters) re-scheduled jobs are likely to end up on those nodes because they will always be empty and able to accept new jobs, thereby creating a "black hole". The new feature can memorize nodes of failed jobs and attempt to exclude those jobs.

Implementation:

Keep a persistent list of preferred nodes. Notice possibly broken nodes, erase those from preferred nodes.

Submit to preferred nodes (optionally, as this might lead to longer wait times!). Exclude possibly broken nodes from submission. Report possibly broken nodes.

Will only work in the context of ONE workflow.

Idea be @johanneskoester and @cmeesters

@cmeesters cmeesters added the enhancement New feature or request label Jan 9, 2024
@cmeesters cmeesters self-assigned this Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant