Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable uri-reference format check in jsonsschema #2771

Merged
merged 6 commits into from Jan 4, 2023

Conversation

binste
Copy link
Contributor

@binste binste commented Dec 25, 2022

Fix for #2705 and #2767

Problem

Since 4.13.0, jsonschema does format validation by default. If it is installed with the format extra dependencies, this leads to errors for some of the Vega-Lite $ref values such as #/definitions/ValueDefWithCondition<MarkPropFieldOrDatumDef,(Gradient|string|null)> as they are not proper URIs as characters such as < are not encoded. At one point, Vega-Lite did do URL encoding, see vega/vega-lite#5838, but this was reverted in vega/vega-lite#5869.

Reproduce

Install jsonschema[format]>=4.13 and run

import altair as alt
from vega_datasets import data

source = data.movies.url

pts = alt.selection(type="point", encodings=['x'])

rect = alt.Chart(data.movies.url).mark_rect().encode(
    alt.X('IMDB_Rating:Q', bin=True),
    alt.Y('Rotten_Tomatoes_Rating:Q', bin=True),
    alt.Color('count()',
        scale=alt.Scale(scheme='greenblue'),
        legend=alt.Legend(title='Total Records')
    )
)

circ = rect.mark_point().encode(
    alt.ColorValue('grey'),
    alt.Size('count()',
        legend=alt.Legend(title='Records in Selection')
    )
).transform_filter(
    pts
)

bar = alt.Chart(source).mark_bar().encode(
    x='Major_Genre:N',
    y='count()',
    color=alt.condition(pts, alt.ColorValue("steelblue"), alt.ColorValue("grey"))
).properties(
    width=550,
    height=200
).add_params(pts)

chart = alt.vconcat(
    rect + circ,
    bar
).resolve_legend(
    color="independent",
    size="independent"
)
chart

Proposed solution

As the URIs are already not encoded in the Vega-Lite schema, we do not want to change the URIs. Therefore, we need to disable the format check introduced in jsonschema. This PR introduces a new function validate_jsonschema which serves as a drop-in replacement for all calls to jsonschema.validate. It does two things:

  • Temporarily remove the uri-reference check before calling jsonschema.validate
  • Ensure that the same validator (Draft7Validator) is used everywhere in the Altair codebase. This was not the case before as some calls to jsonschema.validate did not pass a validator class which leads jsonschema to fall back on Draft202012Validator.

Tested with jsonschema versions 3, 4.16.0, and 4.17.3. Don't expect issues for versions in between as the FormatChecker class already had the checker attribute as a dictionary in version 3 https://github.com/python-jsonschema/jsonschema/blob/v3.0.0/jsonschema/_format.py#L34

Not directly related to the issue above but I also cleaned up an old reference to draft 4 along with some unused functionality, see 4b6a2d5

@binste binste force-pushed the adapt_to_new_jsonsschema_version branch from 8f35a96 to 4b6a2d5 Compare December 25, 2022 11:02
@binste
Copy link
Contributor Author

binste commented Dec 25, 2022

To spot similar cases in the future, we could potentially extend the test pipeline to install the [format] extra dependencies for jsonschema as else some of the format checks won't run. What do you think?

@binste binste force-pushed the adapt_to_new_jsonsschema_version branch from 4b6a2d5 to 95cf206 Compare December 25, 2022 11:53
@mattijn
Copy link
Contributor

mattijn commented Dec 27, 2022

Thanks @binste, can you check what happens if you parse the example of vega/vega-lite#5838 (comment) with jsonschema 4.13.0?

import jsonschema

schema = {
    "definitions": {
        "Foo<hello>": {"type": "string"}
    },
    "properties": {
        "x": {"$ref": "#/definitions/Foo<hello>"},
        "y": {"$ref": "#/definitions/Foo%3Chello%3E"}
    }
}

spec = {"x": "x value", "y": "y value"}

jsonschema.validate(spec, schema)

From what I understand it's not the grammar that is wrong, but the parsers that cannot handle special characters. Could you verify this?

@binste
Copy link
Contributor Author

binste commented Dec 28, 2022

Running that example with jsonschema 4.13 gives the following error

ValueError                                Traceback (most recent call last)
File ~/.local/lib/python3.11/site-packages/jsonschema/_format.py:135, in FormatChecker.check(self, instance, format)
    134 try:
--> 135     result = func(instance)
    136 except raises as e:

File ~/.local/lib/python3.11/site-packages/jsonschema/_format.py:360, in is_uri_reference(instance)
    359     return True
--> 360 return rfc3987.parse(instance, rule="URI_reference")

File ~/.local/lib/python3.11/site-packages/rfc3987.py:462, in parse(string, rule)
    461 if not m:
--> 462     raise ValueError('%r is not a valid %r.' % (string, rule))
    463 if REGEX:

ValueError: '#/definitions/Foo<hello>' is not a valid 'URI_reference'.

The above exception was the direct cause of the following exception:

SchemaError                               Traceback (most recent call last)
Cell In[1], line 15
      3 schema = {
      4     "definitions": {
      5         "Foo<hello>": {"type": "string"}
...
Failed validating 'format' in metaschema['allOf'][1]['properties']['properties']['additionalProperties']['$dynamicRef']['allOf'][0]['properties']['$ref']:
    {'format': 'uri-reference', 'type': 'string'}

On schema['properties']['x']['$ref']:
    '#/definitions/Foo<hello>'

The "URI_reference" format check which fails here is the one that would be disabled with this PR.

In my limited understanding, URIs (as they are used as values for "$ref") should be percent-encoded according to RFC3986, e.g. "<" becomes "%3C".
I don't know why this is not done in the Vega-Lite schema, maybe some downstream tools don't work well with this (there is a mention of this in vega/vega-lite#5838) or it's simply to improve the readability of the schema for users.
In any case, as schemas which are valid for Vega-Lite are not valid for the new URI-reference format check in jsonschema, I think we need to disable it.
vega/vega-lite#5838 was about the code generation for Altair not working when Vega-Lite briefly introduced percent-encoded URIs as the URIs had to be decoded to get valid Python identifiers in Altair.

@mattijn
Copy link
Contributor

mattijn commented Dec 28, 2022

Thank you for the clear comment @binste.

When reading section 2.2 of the linked rfc-standard above, I tend to agree with you that < is not in the reserved character set and therefor should be encoded.

Since this is about the vega-lite scheme, @domoritz, could you reflect on this as well?

@Julian
Copy link

Julian commented Jan 1, 2023

Hi! (Sorry to hear something broke here).

Just to clarify after seeing this PR cross-linked:

Since 4.13.0, jsonschema does format validation by default.

This isn't the case! Though I assume that's just imprecision -- but the format validation default (in 4.17.0 I presume you mean) is the same as before, it's only that validating schemas now catches cases where schemas are invalid purely from things represented by format validation in the metaschema. I.e. previously, some invalid schemas were let through, now they're not, as you're noticing (and this indeed will hopefully get even more strict in the future, as there are plenty of schemas entirely valid under the metaschema but which are still invalid according to the specification).

Validating instances (i.e. data) is exactly the same, and changing that definitely wouldn't happen without deprecation for backwards compatibility purposes (heck it wouldn't happen at all, since it's not compliant behavior under the spec, but yeah).

I haven't looked in depth at how you use jsonschema, but if you aren't already, I'd highly recommend using jsonschema.DraftWhateverValidator, over jsonschema.validate, and the former allows you to fully control all the behavior you'd like here.

@mattijn
Copy link
Contributor

mattijn commented Jan 1, 2023

Thanks @Julian for your detailed comment. Appreciated!

@Julian
Copy link

Julian commented Jan 1, 2023

Pleasure of course, and if indeed it'd help to have a closer look at any point at the changes feel free to ping me, I love altair :)

@binste
Copy link
Contributor Author

binste commented Jan 2, 2023

Thanks @Julian for your input and in general for your work on python-jsonschema! Indeed, that was just an imprecision on my side, I was refering to the validation of the schema.

Actually, I'd really appreciate it if you could take a look at the proposed solution, especially as I'm not too familiar with your library so there might be a more future-proof way to do this. To make it easier I stripped out the relevant parts into the code examples below so no need to go through the PR.

For the previously mentioned reasons the schema below has a wrong uri-reference due to "<" and ">":

import jsonschema

schema = {
    "definitions": {
        "Foo<hello>": {"type": "string"}
    },
    "properties": {
        "x": {"$ref": "#/definitions/Foo<hello>"},
    },
}
spec = {"x": "x value"}

jsonschema.validate(spec, schema, cls=jsonschema.Draft7Validator)

And so it gives the following error:

...
ValueError: '#/definitions/Foo<hello>' is not a valid 'URI_reference'.
...

Btw, I think it's great that this is introduced in jsonschema. As you mentioned in another thread, the bug is in the schema and not in the jsonschema package.

The proposed solution is to temporarily disable the uri-reference format check:

import jsonschema

schema = {
    "definitions": {
        "Foo<hello>": {"type": "string"}
    },
    "properties": {
        "x": {"$ref": "#/definitions/Foo<hello>"},
    },
}
spec = {"x": "x value"}

validator_cls = jsonschema.Draft7Validator
removed_format_checkers = []
try:
    # In older versions of jsonschema this attribute did not yet exist
    # and we do not need to disable any format checkers
    if hasattr(validator_cls, "FORMAT_CHECKER"):
        for format_name in ["uri-reference"]:
            try:
                checker = validator_cls.FORMAT_CHECKER.checkers.pop(format_name)
                removed_format_checkers.append((format_name, checker))
            except KeyError:
                # Format checks are only set by jsonschema if it can import
                # the relevant dependencies
                continue
    jsonschema.validate(spec, schema, cls=validator_cls)
finally:
    # Restore the original set of checkers as the jsonschema package
    # might also be used by other packages
    for format_name, checker in removed_format_checkers:
        validator_cls.FORMAT_CHECKER.checkers[format_name] = checker

Does this make sense to you or do you think there is an advantage in using the validator class directly so we can pass the format checks to check_schema?

import copy

import jsonschema
from jsonschema import exceptions

schema = {
    "definitions": {
        "Foo<hello>": {"type": "string"}
    },
    "properties": {
        "x": {"$ref": "#/definitions/Foo<hello>"},
    },
}
spec = {"x": "x value"}

validator_cls = jsonschema.Draft7Validator
format_checker = copy.deepcopy(validator_cls.FORMAT_CHECKER)
format_checker.checkers.pop("uri-reference", None)

validator_cls.check_schema(schema, format_checker=format_checker)
validator = validator_cls(schema)
error = exceptions.best_match(validator.iter_errors(spec))
if error is not None:
    raise error

@Julian
Copy link

Julian commented Jan 2, 2023

If you want to treat the schema as valid (essentially get the same behavior you had before), the simplest is likely just:

validator = jsonschema.Draft7Validator(schema)

which doesn't validate your schema, it assumes you're saying it's valid.

(either with or without format checking enabled for validating instances, if you want it, use jsonschema.Draft7Validator(schema, format_checker=jsonschema.Draft7Validator.FORMAT_CHECKER)).

and then using validator.validate(spec) wherever you'd like to validate (or using best_match if you prefer prioritizing errors)

But then you don't need to muck with changing anything and essentially should be able to go on as-is until the schema is fixed.

If you want to guard against other unexpected bugs in the schema though then you may want to call .check_schema yourself as well, but can ignore this error if you're not interested in it, something like:

errors =  jsonschema.Draft7Validator(
    jsonschema.Draft7Validator.META_SCHEMA,
    format_checker=jsonschema.Draft7Validator.FORMAT_CHECKER,
).iter_errors(schema)
ones_we_care_about = [error for error in errors if error.validator_value != "uri-reference"]
if ones_we_carout:
  the_schema_is_even_more_invalid()

But if the schema is static (i.e. if it's like a vega-lite schema they never change) there's no need to call .check_schema to validate it every time you import altair or whatever, you can just tell jsonschema you know it's valid (or here want to treat it as such) by passing it directly into Draft7Validator.

Does that help? (And thanks for the kind words!)

@binste binste force-pushed the adapt_to_new_jsonsschema_version branch from 31afb91 to 9c3b964 Compare January 2, 2023 17:07
@binste
Copy link
Contributor Author

binste commented Jan 2, 2023

Thank you! This was very helpful, especially the differentiation between jsonschema.validate and using a validator class directly and also the input that we might not need to validate the schema at all. Indeed, I don't see a need to do this every time as it comes from vega-lite and is not provided by a user but this should certainly be confirmed by one of the maintainers.

I pushed a simplified version of the fix which uses the validator class instead of jsonschema.validate. The schema is therefore assumed to be valid. Actual errors in the chart specification are still raised as expected. This can be tested with the code below.

import altair as alt
import pandas as pd

source = pd.DataFrame({
    'a': ['A', 'B'],
    'b': [28, 55]
})

chart = alt.Chart(source).mark_bar().encode(
    x=alt.X("a", unknown=2),
    y='b'
)
chart

Raises:

...
Additional properties are not allowed ('unknown' was unexpected)
...

@mattijn
Copy link
Contributor

mattijn commented Jan 2, 2023

Since this this issue was about the vega-lite-schema I thought this issue would be moved eventually to the Vega-lite repo, and maybe it still will, but after this fruitful discussion this PR becomes a nice improvement to the core of Altair.

I see that this PR will use the Draft 7 version of the JSON schema specification (Draft7Validator). Do we know which validator version Vega-Lite is using for their schema? Would it make sense to use the same validator version? The only reference I could find is this line which mentions Draft 6.

The schema is included within Altair so it make sense to validate the vega-lite schema during development. Then we can catch these issues before it is shipped. Maybe we can set up a test for this within Github Action.

While typing I remember the comment earlier on in this thread from @binste:

To spot similar cases in the future, we could potentially extend the test pipeline [..]

Once Altair is released I also see no need to validate the vega-lite-schema itself every time when using Altair. Sure it would be necessary to test the generated JSON by Altair against the schema, but that are two different things indeed.

@Julian
Copy link

Julian commented Jan 2, 2023

Sorry, I'm being a bit lazy by not digging into this myself and providing breadcrumbs instead, but hopefully they're helpful -- as I say I love altair so if this needs more attention on my part I'm happy to help, but seems y'all are getting there. Some more breadcrumbs then :D --

Do we know which validator version Vega-Lite is using for their schema? Would it make sense to use the same validator version?

Here really "ideally" 2 things should happen:

  • vega should specify the $schema property in the schema if they aren't already, which is a "dialect identifier", it basically is a URI that identifies which version of the JSON Schema spec the schema is written for -- an example of one is https://json-schema.org/draft-07/schema which is the dialect ID for draft 7 of the spec
  • if/when that's the case, then you can use jsonschema.validators.validator_for to get the right validator class given the schema -- so instead of the example I wrote, you'd use Validator = jsonschema.validators.validator_for(vega_schema), which will return you Draft7Validator if they use draft 7, but if/when they upgrade to draft2020 say, then you'd get Draft202012Validator instead and things will work properly still.

EDIT: Being slightly less lazy, indeed vega already specifies $schema, so you're good with using that API instead.

The schema is included within Altair so it make sense to validate the vega-lite schema during development. Then we can catch these issues before it is shipped. Maybe we can set up a test for this within Github Action.

Yes! This is the thing I usually recommend -- if you have a static schema which you think is supposed to be valid, validate it in CI, then you'll know if something's wrong, but otherwise the library proceeds as normal.

If you want such a thing you can run in GitHub actions, check-jsonschema is a lovely thing (by @sirosen) which even has pre-commit hooks if you use pre-commit (and if not you can of course do so just manually by running the CLI obviously).

@sirosen
Copy link

sirosen commented Jan 2, 2023

I haven't read the full thread here, but if you wanted a minimal pre-commit config to get started, I believe that this would do the trick

# .pre-commit-config.yaml , in the repo root
# using check-metaschema:
# https://check-jsonschema.readthedocs.io/en/latest/precommit_usage.html#check-metaschema
repos:
  - repo: https://github.com/python-jsonschema/check-jsonschema
    rev: 0.19.2
    hooks:
      - id: check-metaschema
        files: ^altair/vega/v5/schema/vega-schema.json$

Or, if you didn't want to get up and running with pre-commit, you could equally well run this:

pipx install check-jsonschema  # or 'pip install' in CI
check-jsonschema --check-metaschema altair/vega/v5/schema/vega-schema.json

And I'm always happy to help folks out on the check-jsonschema issue tracker! 😄

@mattijn
Copy link
Contributor

mattijn commented Jan 4, 2023

Thanks @sirosen and @Julian! I've opened #2800 and #2801 to track the suggestions you have given.

@binste do you like to have a look if you can resolve the conflicts in this branch before we can merge this? Otherwise I will have a try myself to resolve these.

@binste
Copy link
Contributor Author

binste commented Jan 4, 2023

Great inputs, thank you all!

@mattijn Resolved the conflicts. Ready to be merged from my side.

@mattijn
Copy link
Contributor

mattijn commented Jan 4, 2023

All checks passed. Merging. Thanks again everyone!

@mattijn mattijn merged commit b1774e6 into vega:master Jan 4, 2023
binste added a commit to binste/altair that referenced this pull request Jan 7, 2023
…which did not yet make format checker accessible on validator class
binste added a commit to binste/altair that referenced this pull request Jan 7, 2023
…which did not yet make format checker accessible on validator class
mattijn pushed a commit that referenced this pull request Jan 7, 2023
* Use  property to dynamically determine jsonschema validator

* Fix regression introduced in #2771 for older jsonschema versions which did not yet make format checker accessible on validator class

* Add test
mattijn pushed a commit that referenced this pull request Jan 20, 2023
* DOC: remove unused section

* Disable uri-reference format check in jsonsschema (#2771)

* Disable uri-reference format check. Consistently use same validator across codebase

* Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft

* Add compatibility for older jsonschema versions

* Improve comments

* Simplify validate_jsonschema

* Replace `iteritems` with `items` due to pandas deprecation (#2683)

* Add changelog entry.

* Bump version.

* Run black and flake8.

* Pin selenium in CI.

Co-authored-by: Jake VanderPlas <jakevdp@google.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Joel Ostblom <joelostblom@users.noreply.github.com>
mattijn pushed a commit to mattijn/altair that referenced this pull request Jan 21, 2023
* DOC: remove unused section

* Disable uri-reference format check in jsonsschema (vega#2771)

* Disable uri-reference format check. Consistently use same validator across codebase

* Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft

* Add compatibility for older jsonschema versions

* Improve comments

* Simplify validate_jsonschema

* Replace `iteritems` with `items` due to pandas deprecation (vega#2683)

* Add changelog entry.

* Bump version.

* Run black and flake8.

* Pin selenium in CI.

Co-authored-by: Jake VanderPlas <jakevdp@google.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Joel Ostblom <joelostblom@users.noreply.github.com>
mattijn added a commit that referenced this pull request Jan 21, 2023
* include an altairfuturewarning

* deprecate vega 5 wrappers and render function

* deprecate vegalite 3 wrappers and render function

* use AltairDeprecationWarning

* fix typo in v5 warning

* remove mentioning alternative for vega wrappers

* Backport bug fixes for a 4.2.1 release (#2827)

* DOC: remove unused section

* Disable uri-reference format check in jsonsschema (#2771)

* Disable uri-reference format check. Consistently use same validator across codebase

* Remove validation in SchemaInfo as not used anywhere and it referenced the wrong jsonschema draft

* Add compatibility for older jsonschema versions

* Improve comments

* Simplify validate_jsonschema

* Replace `iteritems` with `items` due to pandas deprecation (#2683)

* Add changelog entry.

* Bump version.

* Run black and flake8.

* Pin selenium in CI.

Co-authored-by: Jake VanderPlas <jakevdp@google.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Joel Ostblom <joelostblom@users.noreply.github.com>

* include note in releases change log

Co-authored-by: Jan Tilly <jan.tilly@quantco.com>
Co-authored-by: Jake VanderPlas <jakevdp@google.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Joel Ostblom <joelostblom@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants