New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More flexible dvc.yaml parameterisation #6107
Comments
@skshetry, I tag you because AFAIK you work with yaml in DVC a lot :) |
Note that the set-params are not the same as parametrized variables. They just happen to use
What should the correct behavior be? What you are specifying there is a null value as per yaml/json terms. There can be questions about how to represent it though.
|
@aguschin Have you tried to search through past issues and discussions for context? @skshetry Are there past issues or discussions that you could provide for context? I can try to find some, but I thought you might be more familiar. For example, AFAIK, the current templating was an intentional choice to add some limited flexibility without the complexity of a full templating language or python API. Similarly, I think there are past discussions about how to automatically detect dependencies (like parameters) and avoid duplication. |
Interesting point about passing params to a script.
I think we still should till try to analyze how common for DVC users it would be (vs |
I might be missing some context (although I have been reading other related issues) but I think that maybe this specific issue could be addressed within To elaborate a bit: The problem seems to be a good candidate to be solved with a python API and In addition, it looks like the issue could be quite common in the training stages of a ML Scenario while also being, at the low level, different depending on the ML repository/framework used. In this case So, given that |
Thread on discord related (to some extent) : https://discordapp.com/channels/485586884165107732/563406153334128681/865257349555027969 |
The full command (including command line arguments) are already part of the stage dependencies. Maybe we could look into parsing and showing the arguments (or at least the full command) as part of Edit: To further explain, if we had better support for showing the command line arguments as dependencies, it wouldn't be necessary to also list these same values in the params file. |
Another idea: any templated values in the command (anything using |
@dberenbaum, templated values are automatically included in
$ dvc exp show --md --no-pager
| Experiment | Created | epochs |
|--------------|-----------|----------|
| workspace | - | 100 |
| master | 11:00 AM | 100 |
$ python -c 'from dvc.repo import Repo; print(Repo().params.show())'
{'': {'data': defaultdict(<class 'dict'>, {'custom-params.yaml': {'data': {'epochs': 100}}})}} |
If I can proffer another potential solution, if DVC could specify a format for not only the dvc.yaml but also yaml files for the |
I think it also should support a simple array like this:
|
well, it should only be the specified paths but I don't see how can I do it without listing each of them separately (unless i create another dir |
Allow to use dictionaries as values for template interpolation but only inside the `cmd` key. Given the following params.yaml and dvc.yaml: ```yaml # params.yaml dict: foo: foo bar: bar ``` ```yaml # dvc.yaml stages: stage1: cmd: python script.py ${dict} ``` The dictionary will be unpacked with the following syntax: ```yaml # dvc.yaml stages: stage1: cmd: python script.py --foo foo --bar bar ``` Closes #6107
Context
One of DVC usage scenarios we should consider is the following:
One example for this is the YOLOv5 repo. The github repo has only testing in CI, but for sure we can have a private fork and multiple CI/CD workflows which train, test, export and deploy models.
Problem
In this case user could have scripts which run like this:
The scripts could have a lot of args (this
train.py
has about 20 of them, all handled by Python argparse module). If I don’t want to modify my scripts yet, I’d need to duplicate all arguments in two places: in params.yaml and in dvc.yaml, thus having three slightly different copies of the arguments.Indeed, this is a bad practice that creates a technical debt of maintaining three copies of the same and shouldn’t be used for too long, but as a user I would prefer to migrate to DVC step by step, not breaking anything in current workflows in the process. And if I’m not sure I want to migrate and just want to give DVC a try I would want to keep all my current scripts unchanged to remove DVC later if I won’t be planning on using it.
Suggestion
Expand args dictionary and lists from params in dvc.yaml to allow user to avoid duplicating every param in dvc.yaml:
which gets translated to
Few questions we should also consider:
1. Is there a use case when a user wants to expand a nested dict?
Example:
2. There are optional args like `--resume` which could be supplied or not. We could support this like following:
In case we want to run just “python train.py”
In case we want to run “python train.py --resume path-to-weights.pt”:
Right now the first use case is not supported at all:
Though params.yaml could be like
And experiment could be run, but “None” value will be used
That said, the last modification could break backwards compatibility if anyone uses the later approach with None substitution, but it looks more like a bug, then a solution, and we can replace this behaviour in the future.
3. Are there other examples of CLI parameters that scripts can take, but dvc.yaml / `dvc exp run` doesn't support now?
Would be great to gather them is well.
My perspective here is limited to python and bash scripts, so I can miss something important. I'm going to gather more examples and possible issues and add more information about them. Meanwhile, it would be great to hear what others think.
The text was updated successfully, but these errors were encountered: