Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New input files are not detected when run config uses globs #645

Open
marxide opened this issue Mar 11, 2022 · 1 comment · May be fixed by #646
Open

New input files are not detected when run config uses globs #645

marxide opened this issue Mar 11, 2022 · 1 comment · May be fixed by #646

Comments

@marxide
Copy link
Contributor

marxide commented Mar 11, 2022

When a user wishes to add images to an existing pipeline run, they modify the config to include the new inputs and relaunch the run. A check is performed to ensure that:

  1. New inputs have been added to the config, and
  2. No other settings have been changed.

Both of these conditions must be true for a pipeline run to be re-run in "add mode". The pipeline checks if the inputs have changed by reading the previous config file config_prev.yml and comparing it with the updated config.yml file. Both config files are parsed, validated, and all glob expressions are resolved.

Suppose that the config inputs are a simple glob expression, e.g.

inputs:
  image:
    glob: /data/vast-survey/VAST/release/EPOCH*/COMBINED/STOKESI_IMAGES/*.fits

If new files that match this expression are added to the filesystem, the pipeline will fail to detect that the inputs have changed. It will read both config_prev.yml and config.yml, which would contain the same glob expression in this case, and compare them. Since the globs are resolved when the config file is read, both config files will end up with the same list of inputs even though new files matching the glob were added since the run was executed.

The problem is that the config diff check only parses the previous config file and doesn't look at which images were actually used.

A potential solution would be to add a comparison of the number of resolved inputs in config.yml with the number of images stored in the Run object (i.e. Run.n_images) to the config diff check. If the number of inputs is greater than the number of images in the run object, then the run should be re-run in add mode. This won't work if images were removed, but that isn't allowed for "add mode" anyway.

@github-actions github-actions bot added this to To do in Pipeline Backlog Mar 11, 2022
@marxide
Copy link
Contributor Author

marxide commented Mar 11, 2022

By the way, the context of this issue is that I found 15 low-band images that weren't included in the combined run. The inputs are specified with a glob expression per epoch, e.g.

inputs:
  image:
    epoch00:
      glob: /data/vast-survey/VAST/release/EPOCH00/COMBINED/STOKESI_IMAGES/*.fits
    epoch01:
      glob: /data/vast-survey/VAST/release/EPOCH01/COMBINED/STOKESI_IMAGES/*.fits
    ...

I don't think there's a way I can add the new images to this config without fixing the config diff check. If I add the new files to the config explicitly, they'll show up twice when the globs are resolved.

marxide added a commit that referenced this issue Mar 16, 2022
New input files are now detected when the config uses glob expressions. Fixes #645.
@marxide marxide linked a pull request Mar 16, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

1 participant