Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stages with conditional dependency #10418

Open
EvanKomp opened this issue May 7, 2024 · 4 comments
Open

Stages with conditional dependency #10418

EvanKomp opened this issue May 7, 2024 · 4 comments
Labels
A: pipelines Related to the pipelines feature feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@EvanKomp
Copy link

EvanKomp commented May 7, 2024

Correct me if this already exists, I seem to see some merges from 2018 that may be related (#646 ) but see no examples.

Essentially I have a stage that prepares a model, of which I would like to specify multiple options as parameters. Each model has a potentially unique preprocessing step, BUT some models share an additional preprocessing step.

For example, param model modulates stage predict, which for some models requires no previous stage, but for others requires a stage preprocess. How can I ensure that preprocess is run for the required models but not rerun it because it is expensive. If I have the preprocess step also conditioned on param model, it will rerun the step even if I switch between models where it does not need to be rerun.

Thanks for any wisdom.

@dberenbaum
Copy link
Contributor

Could you provide a simplified dvc.yaml to clarify how your pipeline is set up?

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label May 8, 2024
@EvanKomp
Copy link
Author

EvanKomp commented May 8, 2024

@dberenbaum

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/       # THIS ONLY NEEDS TO BE A DEPENDANCY OF `model_type` in [A, B]
    outs:
      - ./data/predictions/

@dberenbaum
Copy link
Contributor

Unfortunately, I can't think of a good way to do it without creating separate stages/pipelines. If you have some idea of what you would want it to look like, feel free to suggest it here.

@dberenbaum dberenbaum added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint A: pipelines Related to the pipelines feature and removed awaiting response we are waiting for your reply, please respond! :) labels May 10, 2024
@EvanKomp
Copy link
Author

Affirmative. Thanks for your work. I think expanding on the yaml like you would with a cache tag would be best. eg.

stages:
  preprocess:
    cmd:  ./prepare.sh
    outs:
      - ./data/preprocessing/
  predict:
    cmd ./predict.sh
    params:
      - model.model_type       # One of A, B, C
    deps:
      - ./data/preprocessing/:

# conditioning syntax
           conditions: # these are executable strings with params as local namespace
             - 'model.model_type in ["A", "B"]


    outs:
      - ./data/predictions/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

2 participants