Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible dvc.yaml parameterisation #6107

Closed
aguschin opened this issue Jun 3, 2021 · 13 comments · Fixed by #7907
Closed

More flexible dvc.yaml parameterisation #6107

aguschin opened this issue Jun 3, 2021 · 13 comments · Fixed by #7907
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion

Comments

@aguschin
Copy link
Contributor

aguschin commented Jun 3, 2021

Context

One of DVC usage scenarios we should consider is the following:

  1. User has a big and complex real-life repo and workflows around it (one example is CI/CD)
  2. User wants to apply DVC without modifications to his current scripts, because that will trigger code changes in subsequent workflows, which the user don’t want to do right now, but rather later, when DVC will be integrated and will prove it is useful for the user.

One example for this is the YOLOv5 repo. The github repo has only testing in CI, but for sure we can have a private fork and multiple CI/CD workflows which train, test, export and deploy models.

Problem

In this case user could have scripts which run like this:

python train.py --img 128 --batch 16 --weights weights/${{ matrix.model }}.pt --cfg models/${{ matrix.model }}.yaml --epochs 1 --device $di

The scripts could have a lot of args (this train.py has about 20 of them, all handled by Python argparse module). If I don’t want to modify my scripts yet, I’d need to duplicate all arguments in two places: in params.yaml and in dvc.yaml, thus having three slightly different copies of the arguments.

Indeed, this is a bad practice that creates a technical debt of maintaining three copies of the same and shouldn’t be used for too long, but as a user I would prefer to migrate to DVC step by step, not breaking anything in current workflows in the process. And if I’m not sure I want to migrate and just want to give DVC a try I would want to keep all my current scripts unchanged to remove DVC later if I won’t be planning on using it.

Suggestion

Expand args dictionary and lists from params in dvc.yaml to allow user to avoid duplicating every param in dvc.yaml:

# params.yaml
train-params:
  param1: 1
  param2: 2
  param3: [value1, value2]


# dvc.yaml
stages:
  train:
    cmd: python train.py ${train-params}

which gets translated to

python train.py --param1 1 --param2 2 --param3 value1 value2

Few questions we should also consider:

1. Is there a use case when a user wants to expand a nested dict?

Example:

# params.yaml
train-params:
  param1:
    nested-param: value

# dvc.yaml
stages:
  train:
    cmd: python train.py ${train-params}
2. There are optional args like `--resume` which could be supplied or not. We could support this like following:
# dvc.yaml
stages:
  train:
    deps:
      - train.py
    cmd: python train.py --resume ${resume}

In case we want to run just “python train.py”

$ dvc exp run -S resume= 
> python train.py

# params.yaml
resume:

In case we want to run “python train.py --resume path-to-weights.pt”:

$ dvc exp run -S resume=path-to-weights.pt
> python train.py --resume path-to-weights.pt

# params.yaml
resume: path-to-weights.pt

Right now the first use case is not supported at all:

$ dvc exp run -S param=  
ERROR: Must provide a value for parameter 'param'

Though params.yaml could be like

# params.yaml
param:

And experiment could be run, but “None” value will be used

$ dvc exp run
> python train.py None

That said, the last modification could break backwards compatibility if anyone uses the later approach with None substitution, but it looks more like a bug, then a solution, and we can replace this behaviour in the future.

3. Are there other examples of CLI parameters that scripts can take, but dvc.yaml / `dvc exp run` doesn't support now?

Would be great to gather them is well.

My perspective here is limited to python and bash scripts, so I can miss something important. I'm going to gather more examples and possible issues and add more information about them. Meanwhile, it would be great to hear what others think.

@aguschin aguschin added the discussion requires active participation to reach a conclusion label Jun 3, 2021
@aguschin
Copy link
Contributor Author

aguschin commented Jun 3, 2021

@skshetry, I tag you because AFAIK you work with yaml in DVC a lot :)

@skshetry
Copy link
Member

skshetry commented Jun 3, 2021

  1. We have discussed this before. It's very opinionated to assume it to be --flag=value. What if some user wants it in json to pass to jq for example? So, it's hard to make this decision for the users.

  2. Could you please create a separate issue for this? It's an experiments related issue, stemming from the checks here:

    raise InvalidArgumentError(
    f"Must provide a value for parameter '{param_name}'"
    )

Note that the set-params are not the same as parametrized variables. They just happen to use params.yaml as a default.

And experiment could be run, but “None” value will be used

What should the correct behavior be? What you are specifying there is a null value as per yaml/json terms. There can be questions about how to represent it though.
The unambiguous thing to do is to set empty-strings ''.

  1. There were a few discussions on supporting lists on cmd. But we did not implement it because there might be different representations of the list that the user might need (comma-separated, space-separated, python's list repr, etc), and I did not want to make cmd be a special case (you can use parametrization anywhere). 🙂

@skshetry skshetry added the A: templating Related to the templating feature label Jun 3, 2021
@dberenbaum
Copy link
Contributor

@aguschin Have you tried to search through past issues and discussions for context?

@skshetry Are there past issues or discussions that you could provide for context? I can try to find some, but I thought you might be more familiar.

For example, AFAIK, the current templating was an intentional choice to add some limited flexibility without the complexity of a full templating language or python API. Similarly, I think there are past discussions about how to automatically detect dependencies (like parameters) and avoid duplication.

@shcheklein
Copy link
Member

Interesting point about passing params to a script.

We have discussed this before. It's very opinionated to assume it to be --flag=value. What if some user wants it in json to pass to jq for example? So, it's hard to make this decision for the users.

I think we still should till try to analyze how common for DVC users it would be (vs json to jq which is quite artificial compared to YOLO5 repo). It might turn out that but giving this option (not making it default) we can simplify a lot of ML scenarios.

@daavoo
Copy link
Contributor

daavoo commented Jun 25, 2021

I might be missing some context (although I have been reading other related issues) but I think that maybe this specific issue could be addressed within dvclive instead of (or in addition to) trying to find a solution as part of dvc params.

To elaborate a bit:

The problem seems to be a good candidate to be solved with a python API and dvclive is a separate library focused on providing Python APIs so it doesn't have the "restrictions" of being language agnostic, like it is the case for dvc.

In addition, it looks like the issue could be quite common in the training stages of a ML Scenario while also being, at the low level, different depending on the ML repository/framework used.

In this case Yolov5 uses argparse but other ML repositories/frameworks could use other options or even custom configuration files like it is the case for detectron2 or mmdetection, just to give examples of frameworks focused on object detection models.

So, given that dvclive is intended to eventually further expand it's integrations with ML frameworks/repositories, maybe this dvc params<>training args link could be solved as part of each individual dvclive<>ML framework/repository integration (i.e. in this case a dvclive<>yolov5)? Does this make sense?

@daavoo
Copy link
Contributor

daavoo commented Jul 16, 2021

Thread on discord related (to some extent) :

https://discordapp.com/channels/485586884165107732/563406153334128681/865257349555027969

@dberenbaum
Copy link
Contributor

dberenbaum commented Jan 6, 2022

The full command (including command line arguments) are already part of the stage dependencies. Maybe we could look into parsing and showing the arguments (or at least the full command) as part of dvc exp show? These could potentially be seen as either implicit parameters or another type of dependency.

Edit: To further explain, if we had better support for showing the command line arguments as dependencies, it wouldn't be necessary to also list these same values in the params file.

@dberenbaum
Copy link
Contributor

Another idea: any templated values in the command (anything using ${} syntax) could be automatically included as parameters. This at least avoids having to repeat values in cmd and params in dvc.yaml.

@skshetry
Copy link
Member

skshetry commented Jan 7, 2022

@dberenbaum, templated values are automatically included in params show/diff/exp show results.

# custom-params.yaml
epochs: 100
seed: 20170428

# dvc.yaml
vars:
- custom-params.yaml
stages:
  stage1:
    cmd: ./script.py --epochs ${epochs}
$ dvc exp show --md --no-pager
| Experiment   | Created   | epochs   |
|--------------|-----------|----------|
| workspace    | -         | 100      |
| master       | 11:00 AM  | 100      |

$ python -c 'from dvc.repo import Repo; print(Repo().params.show())'
{'': {'data': defaultdict(<class 'dict'>, {'custom-params.yaml': {'data': {'epochs': 100}}})}}

@QuantumPlumber
Copy link

If I can proffer another potential solution, if DVC could specify a format for not only the dvc.yaml but also yaml files for the -d and -p arguments for dvc run. That way, quite complex dependencies and parameters could be created with utility functions by the helper classes/scripts, and re-used for individual stages, without having to create extremely verbose dvc.yaml files.

@Rusteam
Copy link

Rusteam commented Feb 11, 2022

I think it also should support a simple array like this:
(Similar to func(*[a,b,c,d]) in python or [...array, new_time, new_item2] in js.

When we run foreach .. do ... for items in an array inside dvc.yaml , we might want to join results of each do into one. It would be nice to be able to unroll the array instead of adding each item separately.

Now I do this:

vars:
  - datasets:
     - one
     - two
     - three

stages:
  prep:
    foreach: ${datasets}
    do:
      cmd: python main.py prep ${item}
      outs:
       - data/processed/{item}
  join:
    cmd: python main.py merge ${datasets[0]} ${datasets[1]} ${datasets[2]}
    deps:
      - data/processed/${datasets[0]}
      - data/processed/${datasets[1]}
      - data/processed/${datasets[2]}

Obviously this leads to errors when we forget to update join stage after updating datasets variable.

We can have an option to unroll an array like this:

join:
    cmd: python main.py merge ${...datasets}
    deps:
      -  data/processed/

@dberenbaum
Copy link
Contributor

Thanks, @Rusteam! Why do you change deps between your two examples? Is your data dependency the entire data/processed/ directory or only the specified dataset paths? Your request may also be related to #7151 if you need to specify each dataset dependency independently.

@Rusteam
Copy link

Rusteam commented Feb 12, 2022

well, it should only be the specified paths but I don't see how can I do it without listing each of them separately (unless i create another dir data/processed/dataset) and put all three there.

daavoo added a commit that referenced this issue Jun 16, 2022
Allow to use dictionaries as values for template interpolation but only inside the `cmd` key.

Given the following params.yaml and dvc.yaml:

```yaml
# params.yaml
dict:
  foo: foo
  bar: bar
```

```yaml
# dvc.yaml
stages:
  stage1:
    cmd: python script.py ${dict}
```

The dictionary will be unpacked with the following syntax:

```yaml
# dvc.yaml
stages:
  stage1:
    cmd: python script.py --foo foo --bar bar
```

Closes #6107
@daavoo daavoo linked a pull request Jun 16, 2022 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: templating Related to the templating feature discussion requires active participation to reach a conclusion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants