Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots: flexible dvc.yaml spec #7086

Closed
dberenbaum opened this issue Dec 3, 2021 · 15 comments · Fixed by #7477
Closed

plots: flexible dvc.yaml spec #7086

dberenbaum opened this issue Dec 3, 2021 · 15 comments · Fixed by #7477
Assignees
Labels
A: plots Related to the plots

Comments

@dberenbaum
Copy link
Contributor

YAML doesn't permit dup keys also even if we change that to arrays we will have the issue of duplication and discrepancy, i.e. one template under one key and another one under the second key. So plot names should be unique string, whichever you write in target I suppose and datafile should go as key. We may still use a key if datafile is absent, this will simplify a very common one file case. So:

plots:
- train_vs_val:
    x: epoch
    y: loss
    data: [train_loss.csv, val_loss.csv]


- val_f1.csv:
    x: epoch
    y: [f1_class_0, f1_class_1]
    y_label: f1
    # data is absent, using key value: val_f1.csv

# Separately plot since we have TWO plots, even though with the same data file
- scores_acc:
    x: epoch
    y: acc
    data: scores.csv
- scores_auc:
    x: epoch
    y: auc
    data: scores.csv

Originally posted by @Suor in #5980 (reply in thread)

@dberenbaum dberenbaum added the A: plots Related to the plots label Dec 3, 2021
@dberenbaum
Copy link
Contributor Author

@pared I think it makes sense to implement in reverse order of how it's listed above.

  1. Multiple plot specs from the same file:
# Separately plot since we have TWO plots, even though with the same data file
- scores_acc:
    x: epoch
    y: acc
    data: scores.csv
- scores_auc:
    x: epoch
    y: auc
    data: scores.csv

Requirements:

  • Have a way to define plots spec separate from a stage output.
  • Add an optional data field to specify file, in which case plot key is used as a title.
  • Update plots commands.
  • Nice to have: dvc plots add command to generate plots.
  1. Multiple series per axis:
- val_f1.csv:
    x: epoch
    y: [f1_class_0, f1_class_1]
    y_label: f1
    # data is absent, using key value: val_f1.csv

Requirements:

  • Enable x, y, x-axis, and y-axis fields to be arrays.
  • Show multiple lines/series in plots.
  • Support diffs of multi-series plots -- most likely as a facet grid.
  1. Multiple files within one plot:
- train_vs_val:
    x: epoch
    y: loss
    data: [train_loss.csv, val_loss.csv]

Requirements:

  • Enable data field to be an array.
  • Show multiple lines/series in plots and support diffs (similar to above).
  • Out of scope: x or y field arrays combined with data array. This can throw an error.

@pared
Copy link
Contributor

pared commented Dec 14, 2021

Question: can we provide parallel coordinates plots config alongside/inside plots spec?

@mattlbeck
Copy link
Contributor

Related: #6316

@dberenbaum
Copy link
Contributor Author

Question: can we provide parallel coordinates plots config alongside/inside plots spec?

IMO let's not worry about it as part of this issue.

@daavoo
Copy link
Contributor

daavoo commented May 9, 2022

Commenting here to not pollute #7477 with things that are not implementation-related.

The following syntax is currently supported in the P.R. and can cover all 3 use cases above:

plots:
  scores_acc:
    x: epoch
    y: 
      scores.csv: acc
  scores_auc:
    x: epoch
    y: 
      scores.csv: auc
val_f1:
    x: epoch
    y: 
      val_f1.csv: acc
      val_f1.csv: loss
    y_label: f1
- train_vs_val:
    x: epoch
    y: 
      train_loss.csv: loss
      val_loss.csv: loss

I'm missing the reasoning behind adding the data field. The above syntax looks more intuitive to me.

cc @dberenbaum @pared

@dberenbaum
Copy link
Contributor Author

dberenbaum commented May 12, 2022

@daavoo I find them about equally intuitive, although getting rid of data obviously is fewer fields to manage.

I can see two distinctions using data:

  1. It impacts the legend and axes (compare the left and right plots below). Should dvc try to find a common axis name and strip common field names (acc) from the legend in this case?

Screen Shot 2022-05-12 at 11 32 50 AM

  1. Should dvc support an x-axis from one file and y-axis from another file? For example:
plots:
  confusion:
    x: actual
    y: predicted
    data:
    - dir/actual.csv
    - dir/preds.csv
    template: confusion

Currently, this will produce two separate confusion matrices, which isn't what I want.

It won't work at all AFAIK without data, but maybe it could be supported like this:

plots:
  confusion:
    x: 
      dir/actual.csv: actual
    y:
      dir/preds.csv: predicted
    template: confusion

Thoughts @pared @daavoo?

@daavoo
Copy link
Contributor

daavoo commented May 13, 2022

although getting rid of data obviously is fewer fields to manage

And less code to maintain.

it impacts the legend and axes

I think we should just remove data and iterate on the UI for legends.

but maybe it could be supported like this:

plots:
confusion:
x:
dir/actual.csv: actual
y:
dir/preds.csv: predicted
template: confusion

I would prefer this new syntax.

Thoughts @pared @daavoo?

I would vote for getting rid of data.

About the x-axis flexibility, I don't really know. How hard would it be if we focus on the explicit {path}:{entry} syntax @pared ?
If it changes too many things, we could leave all this as a future follow-up. There are already a lot of new use cases we are covering with the current state of the P.R.

@pared
Copy link
Contributor

pared commented May 13, 2022

AFAIR it should be not that big of a problem, we already have underlying logic. Just need to modify it a little bit. At least for y. x did not have this logic till now but still, I think we should be able to handle that. Currently I need to kind of search for x and handle situations when the particular column is not in the targeted files. With explicit x that would be easier, I think.

@dberenbaum
Copy link
Contributor Author

So how does this look for updated syntax of the original comment at the top of this issue?

plots:
- train_vs_val:
    x: 
    - train_loss.csv:epoch
    - val_loss.csv:epoch
    y: 
    - train_loss.csv:loss
    - val_loss.csv:loss

- val_f1.csv:
    x:
    - epoch
    - epoch
    y:
    - f1_class_0
    - f1_class_1
    y_label: f1
    # data source is absent, using key value: val_f1.csv

# Separately plot since we have TWO plots, even though with the same data file
- scores_acc:
    x: scores.csv:epoch
    y: scores.csv:acc
- scores_auc:
    x: scores.csv:epoch
    y: scores.csv:auc

It is more verbose, but I like that it's more clear and explicit. My only question would be to what extent to allow shortcuts and try to infer expected behavior. For example:

  • If you specify the data source for only one of x or y, should dvc try to infer it for the other?
  • In the val_f1.csv example, do you actually need to specify the x value twice?

I don't mind requiring everything to be listed explicitly for now since it does make behavior more clear.

@skshetry
Copy link
Member

skshetry commented May 14, 2022

Another suggestion, that clearly separates plot level metadata with props, it's a bit verbose though.

plots:
  train_vs_val:
    title: title
    props:
    - path: train_loss.csv  # or, train_loss.csv:epoch as proposed above
      key: epoch
      axis: x
    - path: val_loss.csv
      key: epoch
      axis: x
      label: f1

@dberenbaum
Copy link
Contributor Author

dberenbaum commented May 16, 2022

@pared and I discussed and came up with the following action points:

  • Remove data field support.
  • Add new issue for x path support.
  • Add new issue for CLI support.
  • Check that syntax is consistent with params.

Edit:

@pared We forgot to add in cleaning up the legend and labels (see the first point in #7086 (comment)).

@dberenbaum
Copy link
Contributor Author

dberenbaum commented May 16, 2022

@pared The example above in #7086 (comment) uses file paths as dict keys under y, which doesn't work in one of the examples:

val_f1:
    x: epoch
    y: 
      val_f1.csv: acc
      val_f1.csv: loss
    y_label: f1

This will raise a duplicate keys error. Should it instead support a list of values here (note that this isn't supported today)?

plots:
  scores_acc:
    x: epoch
    y: 
    - scores.csv: 
      - acc
  scores_auc:
    x: epoch
    y: 
    - scores.csv:
      - auc
val_f1:
    x: epoch
    y: 
    - val_f1.csv:
      - acc
    - val_f1.csv:
      - loss
    y_label: f1
train_vs_val:
    x: epoch
    y: 
    - train_loss.csv:
      - loss
    - val_loss.csv:
      - loss

This would be consistent with how y works without paths since it expects a list if there are multiple y values, and it would be similar to how stage parameters are handled:

stages:
  train:
    cmd: python train.py
    params:
    - config.yaml:
      - alpha
      - beta
    outs:
    - model.h5

@pared
Copy link
Contributor

pared commented May 19, 2022

@dberenbaum yeah, that makes sense, we can also support both, support for dict is in place anyway, so...

@pared
Copy link
Contributor

pared commented May 19, 2022

@skshetry This is definitely very explicit. The question is whether it won't be too much for users to type. I think we should be going for the least amount of work needed for user to input the data. On the other hand that implies a lot of automatic behavior (like infering x data sources knowing y) which might be to hard to grasp.

@pared
Copy link
Contributor

pared commented May 19, 2022

@daavoo
regarding this:

val_f1:
    x: epoch
    y: 
      val_f1.csv: acc
      val_f1.csv: loss
    y_label: f1

It can be achieved with:

val_f1:
    x: epoch
    y: 
      val_f1.csv: [acc, loss]
      y_label: f1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants