Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill anchors instead of quoted anchors #28

Closed
WilliamHarvey97 opened this issue Apr 6, 2022 · 14 comments · Fixed by #145
Closed

Fill anchors instead of quoted anchors #28

WilliamHarvey97 opened this issue Apr 6, 2022 · 14 comments · Fixed by #145

Comments

@WilliamHarvey97
Copy link

WilliamHarvey97 commented Apr 6, 2022

In my custom vega-lite template, I tried do to something like this to compute the sum of squared residuals:

           "transform": [
                {
                    "joinaggregate": [
                        {
                            "op": "mean",
                            "field": "lab",
                            "as": "mean_y"
                        }
                    ]
                },
                {
                    "calculate": "pow(datum.<DVC_METRIC_Y> - datum.<DVC_METRIC_X>,2)",
                    "as": "SR"
                },
                {
                    "joinaggregate": [
                        {
                            "op": "sum",
                            "field": "SR",
                            "as": "SSR"
                        }
                    ]
                }
            ]

However, it seems impossible to do this because it looks like dvc-render replace only a quoted anchor like f'"{cls.anchor(name)}"' instead of just the anchor f'{cls.anchor(name)}'

return f'"{cls.anchor(name)}"'

I didn't dig a lot into this, but I was just wondering if this could just be this way so we could use anchors on datum?

Thank you!

@dberenbaum
Copy link
Contributor

@pared Thoughts on this?

@pared
Copy link
Contributor

pared commented Apr 11, 2022

The use case makes perfect sense. However, there is a certain reason why we quote the anchors, and this is because we treat the templates as dictionaries (not only in code, but for example in integrations, like vscode). Having quoted anchors allows us to parse and dump from/to json. We could think about involving some logic regarding the quotes. In the case provided by @WilliamHarvey97 we could detect that anchors are used within a string and it's safe to replace them without quotes.

@dberenbaum
Copy link
Contributor

@pared Maybe we aren't being clear about whether templates are:

  1. JSON documents where anchors are JSON values to be replaced OR
  2. Text documents where anchors are strings to be replaced.

Is VS Code expecting each anchor to be injected as a value to a JSON key or as a string into a text doc? cc @mattseddon

@pared
Copy link
Contributor

pared commented Apr 11, 2022

@dberenbaum the main problem is that templates are both. When replacing the anchors we treat them as text files. Inside DVC and the integrations we treat them as JSON dictionaries, for convenient parsing. Also, as I recall we might be relying on the current behavior in some places (@mattseddon don't you change the colors of the plots somewhere?).

@mattseddon
Copy link
Member

@dberenbaum we are currently holding the template as a string so that we can do a String.replace and coerce the string into JSON like this:

  JSON.parse(
    template.replace('"<DVC_METRIC_DATA>"', JSON.stringify(datapoints))
  ) as TopLevelSpec

@pared we update the colors by overwriting that part of the encoding with:

  encoding: {
    color: {
      legend: { disable: true },
      scale: colorScale
    }
  }

Where colorScale has this type: { domain: string[]; range: string[] }.

I am open to this changing we just need to be careful not to break anything.

Note: the way that our code is currently setup could be improved as we actually convert the initial CLI output into a JSON object and then back to a string, and then back to JSON once we know which datapoints we want to display.

@pared
Copy link
Contributor

pared commented Apr 12, 2022

So I guess we have two ways to go:

  1. Stop treating the templates as json dictionaries - that would require synchronous changes for DVC, Studio and VSCode, makes debugging a bit harder
  2. Add functionality to fill anchors that are not entrenched within quotes, it would require a bit of logic inside template but should be possible to do.

There still exists possibility to treat templates only as dictionaries. In this way, we would remove anchors and would need to provide path within dict and data to fill it with. Mentioning it only for historical accuracy, as it would require users to learn vega to create their own templates.

@daavoo
Copy link
Contributor

daavoo commented Apr 12, 2022

Related #23 ?

@dberenbaum
Copy link
Contributor

There still exists possibility to treat templates only as dictionaries. In this way, we would remove anchors and would need to provide path within dict and data to fill it with. Mentioning it only for historical accuracy, as it would require users to learn vega to create their own templates.

@pared Could you explain more or show an example of what templates would look like without anchors?

Also, it might be fine to not allow use cases like pow(datum.<DVC_METRIC_Y> - datum.<DVC_METRIC_X>,2) as long as we clearly document it. We may need to update the docs anyway to clarify current behavior (are the anchors <DVC_METRIC_X> or "<DVC_METRIC_X>")?

One more thought: it seems like only <DVC_METRIC_DATA> is ever anything other than a string, so maybe it makes sense to do a simple replace without quotes for everything other than that anchor?

@pared
Copy link
Contributor

pared commented Apr 14, 2022

One more thought: it seems like only <DVC_METRIC_DATA> is ever anything other than a string, so maybe it makes sense to do a simple replace without quotes for everything other than that anchor?

That would make sense.

Regarding the no-anchors approach:
Every templates is simply a dict.
So, if we know how vega works, we will know where should we put particular fillers. eg. when inserting data we would need to dovega_dict["data"] = {DVC_PROVIDED_DATA} in the filling code.

The issue gets a bit complicated when talking about different types of plots. For example, in our simplest plot, we can provide configuration for x and y axis and color as a dictionary with proper structure. This gets a bit complicated in more advanced templates, like confusion matrix, where we have to provide x and y for some data transformations that we perform. When creating it, it seemed much easier to ask users to create their own template (with their own data) and in the end just replace their data references with proper anchors.

@pared
Copy link
Contributor

pared commented Apr 22, 2022

BTW @WilliamHarvey97 what kind of plot are you trying to implement? Maybe we could approach it in some other way.

@WilliamHarvey97
Copy link
Author

WilliamHarvey97 commented Apr 23, 2022

I was implementing an actual v.s. predicted plot. I wanted to add MAPE, RMSE, R2 and other useful metrics to show at the top of the plot. We often share our results within my team and I found it useful to include these metrics at the top so we don't need to dig further.

In the end, we chose to version png image. We wanted to use the color of each point to encode something else and in the current implementation of dvc-render, the color is reserved for version encoding. I prefer the side-to-side comparison of png.

@pared
Copy link
Contributor

pared commented Apr 25, 2022

@WilliamHarvey97 Got it. Well, what you are trying to achieve should be the scope of iterative/dvc#7086. Would you mind sharing some example of your use case? I am preparing some backward incompatible changes for dvc 3.0 and knowing more use cases might help me define it better.

@daniel-falk
Copy link
Contributor

This would be a really nice feature to support since a lot of use cases depends on the calculate transform. In my case I wanted to normalize across all the elements in each revision. I ended up doing a "hack" and using the sum operator of the joinaggregate transform grouped over a unique combination of values, this way I could rename a field specified with an anchor to a fixed string which I then could use in the calculate transform. Like this:

  "transform": [
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "<DVC_METRIC_Y>",
        "as": "count"
      }],
      "groupby": ["rev"]
    },
    {
      "joinaggregate": [{
          "op": "sum",
          "field": "<DVC_METRIC_Y>",
          "as": "variable"
      }],
      "groupby": ["<DVC_METRIC_X>", "rev"]
    },
    {
      "calculate": "datum.variable/datum.count",
      "as": "variable_as_percent"
    }
  ],

Here is a working example in vega-lite:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "values": [
      {
        "aggregate": 4208.201290480213,
        "prop": "prop1",
        "rev": "workspace"
      },
      {
        "aggregate": 1841.9859846942586,
        "prop": "prop2",
        "rev": "workspace"
      },
      {
        "aggregate": 1051.8727099035762,
        "prop": "prop3",
        "rev": "workspace"
      },
      {
        "aggregate": 740.6659046177122,
        "prop": "prop4",
        "rev": "workspace"
      },
      {
        "aggregate": 483.1788367367855,
        "prop": "prop5",
        "rev": "workspace"
      },
      {
        "aggregate": 567.9014562535242,
        "prop": "prop6",
        "rev": "workspace"
      },
      {
        "aggregate": 2183.434726669379,
        "prop": "prop7",
        "rev": "workspace"
      },
      {
        "aggregate": 2208.201290480213,
        "prop": "prop1",
        "rev": "HEAD"
      },
      {
        "aggregate": 1841.9859846942586,
        "prop": "prop2",
        "rev": "HEAD"
      },
      {
        "aggregate": 1051.8727099035762,
        "prop": "prop3",
        "rev": "HEAD"
      },
      {
        "aggregate": 740.6659046177122,
        "prop": "prop4",
        "rev": "HEAD"
      },
      {
        "aggregate": 483.1788367367855,
        "prop": "prop5",
        "rev": "HEAD"
      },
      {
        "aggregate": 567.9014562535242,
        "prop": "prop6",
        "rev": "HEAD"
      },
      {
        "aggregate": 2083.434726669379,
        "prop": "prop7",
        "rev": "HEAD"
      }
    ]
  },
  "mark": "line",
  "transform": [
    {
      "joinaggregate": [{"op": "sum", "field": "aggregate", "as": "count"}],
      "groupby": ["rev"]
    },
    {
      "joinaggregate": [{"op": "sum", "field": "aggregate", "as": "variable"}],
      "groupby": ["prop", "rev"]
    },
    {"calculate": "datum.variable/datum.count", "as": "variable_as_percent"}
  ],
  "encoding": {
    "x": {"field": "prop"},
    "y": {"field": "variable_as_percent", "type": "quantitative"},
    "color": {"field": "rev"}
  }
}

@pared
Copy link
Contributor

pared commented May 11, 2022

@daniel-falk
Good idea with the workaround. I will try to look into supporting this once iterative/dvc#7086 is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants