run config diff check now considers number of images #646

marxide · 2022-03-16T18:52:07Z

New input files are now detected when the config uses glob expressions. Fixes #645.

ajstewart · 2022-03-17T10:10:32Z

Something I thought of on seeing this and #645, does a missing image impact negatively on the run at all? I can't remember if the images.parquet is added to or it is constructed again using the images input - assuming that everything is there from the previous one.

I kind of feel like there should be a refusal to run if a previous image that has already been processed in the run is not found. Given that this issue stems from the glob command, the user may not be aware that they have mucked up their command, or something has gone wrong with the file that they expected to be there with glob. As my concern is also with the new source and forced extractions that go back and use the images through the run - this could fail or not be correct if an image is unintentionally missing. Again I can't quite remember how the images dataframe is constructed again in add image mode, but I wouldn't be surprised if some part of it relies on the input from the 'add mode config file'.

marxide · 2022-03-17T20:41:51Z

I kind of feel like there should be a refusal to run if a previous image that has already been processed in the run is not found.

I agree. I'm also not yet sure what the consequences of missing inputs are during a re-run but regardless I think it would be better if the pipeline raised an error.

I'm also now thinking about the restore run functionality and how that would work with this. Suppose a run using globs is completed, then new images are added to the filesystem without changing the run config, and the run is re-run and fails. A check is run during the restorepiperun command that compares the previous config inputs with the images.parquet.bak file:

vast-pipeline/vast_pipeline/management/commands/restorepiperun.py

Lines 63 to 80 in c681f96

    
           # check images match 
        
           img_f_list = prev_config["inputs"]["image"] 
        
           if isinstance(img_f_list, dict): 
        
               img_f_list = [ 
        
                   item for sublist in img_f_list.values() for item in sublist 
        
               ] 
        
           img_f_list = [os.path.basename(i) for i in img_f_list] 
        
           prev_images = pd.read_parquet( 
        
               bak_files['images'], columns=['id', 'name', 'measurements_path'] 
        
           ) 
        
           if sorted(prev_images['name'].tolist()) != sorted(img_f_list): 
        
               raise CommandError( 
        
                   'Images in previous config file does not' 
        
                   ' match those found in the previous images.parquet.bak.' 
        
                   ' Cannot restore pipeline run.' 
        
               )

I think this check would fail. The previous config is the same as the new config (it contains the same globs) but the parsed file list will be different since the filesystem changed. Perhaps checking that the parsed file list is a superset of the inputs given in images.parquet.bak would work?

ajstewart · 2022-03-17T21:31:16Z

I think this check would fail. The previous config is the same as the new config (it contains the same globs) but the parsed file list will be different since the filesystem changed. Perhaps checking that the parsed file list is a superset of the inputs given in images.parquet.bak would work?

Uff yes I think you're right, the glob input is really not handled very well in these modes is it 😬, my bad!

I think each of these methods needs an explicit image check that resolves any globs and yes, that the set intersection shows that all images are contained in the glob in the configs/images for the respective mode.

run config diff check now considers number of images

e84eb7a

New input files are now detected when the config uses glob expressions. Fixes #645.

marxide added bug Something isn't working python Pull requests that update Python code labels Mar 16, 2022

marxide self-assigned this Mar 16, 2022

github-actions bot added this to In progress in Nimbus Production Mar 16, 2022

github-actions bot added this to In progress in Pipeline Backlog Mar 16, 2022

marxide added 3 commits March 17, 2022 13:55

updated changelog

9f3e1e7

moved pyfakefs from dev deps to regular deps

231c4b5

linter fix: remove whitespace

c681f96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run config diff check now considers number of images #646

run config diff check now considers number of images #646

marxide commented Mar 16, 2022

ajstewart commented Mar 17, 2022

marxide commented Mar 17, 2022

ajstewart commented Mar 17, 2022

run config diff check now considers number of images #646

Are you sure you want to change the base?

run config diff check now considers number of images #646

Conversation

marxide commented Mar 16, 2022

ajstewart commented Mar 17, 2022

marxide commented Mar 17, 2022

ajstewart commented Mar 17, 2022