Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace $ref & $dynamicRef support / RefResolver with the new referencing library #1049

Merged
merged 33 commits into from Mar 14, 2023

Conversation

Julian
Copy link
Member

@Julian Julian commented Feb 21, 2023

This PR represents a major overhaul, a month or two of work's worth, of the referencing behavior via the $ref and $dynamicRef keywords.

It makes referencing resolution fully compliant across all drafts (yay!) and hopefully makes the user-facing interface for configuring reference resolution a lot easier to understand.

All of the above is by introduction of a new (external) dependency, the referencing library. referencing is going to be in beta for at least a release or two of this library, and then will become 1.0.

Users of referencing behavior are highly encouraged to provide feedback, I could definitely use it. As I say, these interfaces may change slightly (if anything on the referencing side, I think the interface here in jsonschema is pretty intentionally narrow).

But some old use cases are now possible, and it's very likely that a whole lot of $ref-related bugs are now closable. I'll be going through all the bugs which are related and making notes / closing them with reference to this PR as part of the release (which will be 4.18.0, though likely after at least one beta release, 4.18.0b1).

As you might notice from the version number, RefResolver, though now fully deprecated, should be free from any functional changes, as this PR is intended to be fully backwards compatible. Fingers crossed that that's the case, though I'm bracing to see if someone has mucked with RefResolver's internals in a drastic enough way to make that not be the case.

New documentation is also written (previewable below) on a JSON Schema Referencing page.

Performance-wise, some initial benchmarks indicate that despite the increased level of compliance with the spec, performance should be on par, and in some cases better than before. The benchmark suite is anything but comprehensive though, so performance feedback is also welcome, we'll do what we can to address any issues uncovered there.

This PR will likely get merged shortly after a few final checks (mainly it was the documentation-writing, so now we're getting quite close). If you're someone with feedback, please feel free to leave it on the PR. Perhaps @sirosen will ping you specifically if you don't mind, as I have on my list to see how easy it is to plug referencing into check-jsonschema as a good test of whether the new interface is easy enough to use -- I'm happy to try to do so myself or otherwise if you try it I'd love feedback on how easy you find doing so.


📚 Documentation preview 📚: https://python-jsonschema--1049.org.readthedocs.build/en/1049/

Referencing happens to not support it at the minute.

It's close to EOL, so rather than adding it, I suspect
it's droppable.
Still will be tweaked as the referencing library's public API changes.
Internal uses of it will be removed, replaced with referencing's
resolution APIs, though RefResolver will continue to function during
its deprecation period.
This will make it easier to swap over to referencing's
Resolver (while preserving backwards compat for anyone who
passes a RefResolver to a Validator).

At some point in the future this method may become public
(which will make it easier for external dialects to resolve
references) but let's keep it private for a bit until it's
clear that the interface is stable -- a future draft might
do crazy things like have adjacent properties to $ref affect
the resolution behavior, which would make this method need to
take more than just the $ref value.
It will imminently be replaced by referencing.Registry-based
resolution.
Just avoids a deprecation warning when we switch over.
Makes way for our newer resolver to live in the shorter name.

(This should have no effect on the public API, where we have
Validator.resolver returning this attribute after emitting a deprecation
warning.)

Also bumps the minimum attrs version we depend on, as we need the alias
functionality.
Passes all the remaining referencing tests across all drafts, hooray!

Makes Validators take a referencing.Registry argument which users should
use to customize preloaded schemas, or to configure remote reference
retrieval.

This fully obsoletes jsonschema.RefResolver, which has already been
deprecated in a previous commit. Users should move to instead loading
schemas into referencing.Registry objects.

See the referencing documentation at https://referencing.rtfd.io/ for
details (with more jsonschema-specific information to be added shortly).

Note that the interface for resolving references on a Validator is not
yet public (and hidden behind _resolver and _validate_reference
attributes). One or both of these are likely to become public after some
period of stabilization.

Feedback is of course welcome!
<19.3 can have performance problems on 3.11 where there aren't
wheels (and where I think it's falling back to using the pure-
Python implementation even on CPython). Here it's ~2x slower.
@sirosen
Copy link
Member

sirosen commented Feb 22, 2023

If you're someone with feedback, please feel free to leave it on the PR. Perhaps @sirosen will ping you specifically if you don't mind, as I have on my list to see how easy it is to plug referencing into check-jsonschema as a good test of whether the new interface is easy enough to use -- I'm happy to try to do so myself or otherwise if you try it I'd love feedback on how easy you find doing so.

If you're willing to wait a couple of days, I'd like to take a crack at doing this tomorrow.
(Looking at my schedule for today, I'm not sure I can fit it in.)

The new doc is very promising, and it should slot in as a replacement pretty easily based on what I'm seeing.

I'm not certain yet, but this may help with a few different use cases. I know one thing that people wanted was a way to disable network callouts, which was awkward to handle before.

@Julian
Copy link
Member Author

Julian commented Feb 22, 2023

Definitely willing to wait! Thanks for the eyes / brain.

@Julian
Copy link
Member Author

Julian commented Feb 24, 2023

Thanks! That all looks like great feedback. I'll probably try to take most of today (and maybe even tomorrow and Sunday) away from a computer to relax a bit, so it may be a few days before I read + respond, let alone address 'em with changes. But super appreciated.

@Julian
Copy link
Member Author

Julian commented Feb 24, 2023

OK against my better judgement I actually did a tiny bit of work this afternoon :D -- will respond (or address with docs) the rest of your comments (both of y'all, thanks again), but few quick notes.

I may have a file on disk with an $id containing its canonical URI. I'm not sure I have this right, but it seems to me like the best thing to construct when both are available and using referencing:

You actually don't need to do this at all! Most of the point of a Registry is to encapsulate this kind of specified behavior -- in particular, schemas saying "my URI is the thing in $id" is specified behavior of the JSON Schema specs, so Registry will discover that automatically. It's only when your schema doesn't declare what its URI(s) are that you need to externally tell Registry where you got the resource from.

But specifically, if I have a resource that I want to refer to via bar://baz, whose canonical URI is foo://example.json (the thing in $id in draft 2020) this is designed to work just fine with no extra work:

from referencing import Registry, Resource
resource = Resource.from_contents({"$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "foo://example.json"})
registry = Registry().with_resource(uri="bar://baz", resource=resource)

print(registry.get_or_retrieve("foo://example.json").value)
# -> Resource(contents={'$schema': 'https://json-schema.org/draft/2020-12/schema', '$id': 'foo://example.json'}, _specification=<Specification name='draft2020-12'>)

print(registry.get_or_retrieve("bar://baz").value)
# -> same

(I'm showing it via the Registry API but you can try it with jsonschema as well).

The "real question" to me was something like what @jpmckinney mentioned, where it's annoying to have to do .with_resource(resource.id(), resource) to add a resource where you only want the URI that comes from $id -- there are some minor annoying reasons I didn't elaborate on why I think it's a bit tricky to just add that without it being confusing, which I can go into further if anyone cares to discuss. But the case where you have external URI + internal one is "automatic"!

It feels like I ought to be able to slot in a new class to replace RefResolutionError.

I'll address the basic idea later, I've waffled a bit over the last ~15 years on how much I like having common base classes for exceptions so perhaps! But ignoring that for a second, the idea is/was that jsonschema-specific code (so validators in jsonschema) should only really need to deal with Unresolvable, which I think should be the only exception visible from running Validator.validate / Validator.iter_errors. You can only get the other 2 if you use referencing-specific APIs (where NoSuchResource and Unretrievable is what you get from Registry APIs, which are really designed for people writing JSON Schema tooling) and Unresolvable is the "Equivalent" of RefResolutionError itself. But if you've managed to see the other exceptions bubble up I'd be curious to hear/see it, maybe I made a mistake there.

I feel like this case has gotten slightly harder, but I'm not 100% sure since it has never been totally smooth.

I think there's definitely room to add a helper (probably in referencing I guess, since having a whole other package seems clunky) which basically is like filesystem_retriever(root_dir) -> Callable[...] that handles doing this correctly and should be pretty easy to write once. Just in case, you've seen this section from the doc page I assume, if there's anything that can be added there to make this clearer (whether or not we have the helper function) let me know.

Of course if you have suggestions for how to make these clearer they're also welcome!

And yeah will get to responding to the rest after the weekend.

@Julian
Copy link
Member Author

Julian commented Feb 27, 2023

I've added a high-level porting table, some more initial discussion, and stressed that registries are immutable -- the doc preview should update -- will get to the rest in the next day or two!

@Julian
Copy link
Member Author

Julian commented Feb 27, 2023

I'm not sure I'm fully convinced of the utility of adding such a thing yet, but I also pushed a branch with the helper I alluded to in:

I think there's definitely room to add a helper [...] which basically is like filesystem_retriever(root_dir) -> Callable[...] that handles doing this correctly

Feedback is welcome on that too (as I say I'm not sure the complexity is worthwhile, but it's an option).

@sirosen
Copy link
Member

sirosen commented Feb 27, 2023

I considered originally suggesting that referencing should offer a couple of builtin retrievers to imitate the remote and local filesys resolution from RefResolver, but then it started to look to me like a benefit of the rewrite that the behavior is not builtin. One of the reasons that this API improves extensibility is that you have to supply a retriever, so, as a user, it's always clear where and how to change behavior (e.g. to support yaml).

On the flipside, a lot of people are going to want this behavior, so it seems silly not to provide it somehow...

What about splitting the difference and offering the function as an example? (You could even keep the tests, and put it all together as examples/retreivers.py + examples/test_retrievers.py?)

I'll take a look at the new doc after my workday; thanks for doing another pass!

@Julian
Copy link
Member Author

Julian commented Feb 27, 2023

then it started to look to me like a benefit of the rewrite that the behavior is not builtin. One of the reasons that this API improves extensibility is that you have to supply a retriever, so, as a user, it's always clear where and how to change behavior (e.g. to support yaml).

Yeah spot on, this is exactly what I was trying to say initially for why I didn't include this -- it's a quite "straightforward" interface to define retrieve, and of course someone can create some fancy external package of course if they want to (e.g. if someone wants registries that do autodetection via python-magic, I'll never add that, but some light wrapper package is easy to put out there).

What about splitting the difference and offering the function as an example? (You could even keep the tests, and put it all together as examples/retreivers.py + examples/test_retrievers.py?)

Could work.

I'll take a look at the new doc after my workday; thanks for doing another pass!

Great, thanks!

And... after all that, I also couldn't resist being a bit over the top -- so there's a shortcut now for registry.with_resource(uri=resource.id(), resource=resource) for the case where you have a schema with $id key (cc @jpmckinney). It's... resource @ registry :) -- like I said earlier, I wasn't too keen to add a fourth method called with_*, when you can see it's already confusing for early adopters to figure out which ones to use. But this seems perhaps like the sort of thing that someone might find if they're looking for it but otherwise not confuse them if they're not. Maybe. Dunno. (Of course feedback on that too is welcome -- the nice thing is that the reason it's attractive to add such a thing is you want an error if resource.id() doesn't exist, but with_resource doesn't raise one.).

@sirosen
Copy link
Member

sirosen commented Feb 28, 2023

I just did another read and I like the results. 👍


[Regarding $id + a distinct base URI] You actually don't need to do this at all!

I just rewrote my code to remove the unnecessary with_resource call and condition and it looks much cleaner and nicer; thanks for the pointer on this!

I don't think any changes are needed on with_resource, and using infix @ is cool.

It's hard to tell how this could be refined or improved on this part. Probably a significant portion of users will never use these APIs directly at all.
In fact, I think this whole topic falls under the description you gave:

the sort of thing that someone might find if they're looking for it but otherwise not confuse them if they're not.


I've waffled a bit over the last ~15 years on how much I like having common base classes for exceptions so perhaps! ... But if you've managed to see the other exceptions bubble up I'd be curious to hear/see it, maybe I made a mistake there.

To start with, let me dispel any concern: I didn't see the other exceptions raised. I was just updating error handling and it wasn't clear what errors could be raise during resolution. Also, note that the examples include NoSuchResource errors, which I think is currently the only hint as to what to expect.

I think I was a bit confused about why the error classes don't inherit from a common base, more than I think that having a shared based would be a good thing. It wasn't 100% clear to me what the meanings of the errors is.

Rather than worrying about any functional changes to referencing, the key thing to address is how users can know how to write good error handlers. 🙂


A couple of other thoughts:

  • the referencing implementation seems to be significantly slower in my tests, but I haven't benchmarked profiled it yet
  • the thoughts earlier about strictness on dialects might have implications for custom metaschemas?

@Julian
Copy link
Member Author

Julian commented Feb 28, 2023

the referencing implementation seems to be significantly slower in my tests, but I haven't benchmarked profiled it yet

It's a bit tricky, I'm still working on this. Some microbenchmarks are faster (even significantly -- e.g. validator creation is ~2x faster in microbenchmarks). But others indeed are a bit slower. Even though my (jsonschema's) test suite is ~2x slower itself, more targeted microbenchmarks show things aren't that bad. But yeah still working on this.

At least from my own profiling so far, it's actually pyrsistent that's being slow :/ -- simply inserting keys into mappings shows up heavily on profile. I've put in some optimizations already, but it's to the point I'm considering actually writing bindings to another persistent data structure library, we shall see.

(Of course the old implementation isn't fully compliant, so at some point even if things are slower obviously we'd still prefer the new one -- but yeah there's definitely still room to optimize, I'm fairly confident it's possible to make things faster even than the existing main branch... Of course if you do anything there let me know.)

And yeah still not fully typing out a response to the rest, will get back to it -- the dialect thing is tricky. For general URIs, http and https URIs are of course totally distinct -- someone may decide to have two completely different schemas at the two different schemes. Unlikely sure, but totally allowed. Needs more thought on what to do to make it easier -- I still personally recommend never typing this value yourself if you're doing it into Python, use Draft202012Validator.META_SCHEMA["$id"], or copy paste it. But I agree it's a general issue which needs more thought.

We're not a general class, so we know what fields we need
ahead of time.

This seems to give ~15% speedup on Validator evolution, which
happens often as part of walking up and down schemas.
@Julian
Copy link
Member Author

Julian commented Mar 13, 2023

I realize I didn't get back to leaving a response on the dialect URI-related comment (which I think is the last one I didn't address but let me know if I missed another one!) --

But two things:

Suppose I have a schema where $schema is not a valid dialect name. This seems most frequently to happen with the # missing or with http/https being mixed up. I've made these sorts of mistakes, and users less experienced than myself are even more prone to it. Even once you've run into it, it can be hard to spot the malformed string.

Half of this was a bug in jsonschema (one that's now fixed in referencing) -- specifically, URIs with empty fragments are supposed to be equivalent when used as JSON Schema $ids (see here if you care). So now if you write http://foo/# you indeed should get the same thing as http://foo/.

I think that leaves HTTP vs HTTPS -- I think for just that use case a helper probably isn't very useful since it's more about adding resources than retrieving them, but I'm not against some sort of helper at some point if there's more variants. Technically speaking these are different URIs, but if someone chooses to use the helper function I guess it's fine. But yeah maybe not for now, especially given the empty fragment case is handled?

Let me know if you have any other feedback, otherwise I'll likely be merging this in the next day or two, especially given I've basically written enough of a first pass on my rust replacement for pyrsistent to demonstrate it's faster (still not as fast as it could be, but faster!)

@Julian
Copy link
Member Author

Julian commented Mar 14, 2023

Going to merge this I think! But if anyone has further comments feel free to open an issue if needed. A (beta) release will come out shortly!

@Julian Julian merged commit df1501c into main Mar 14, 2023
@Julian Julian deleted the referencing branch March 14, 2023 18:49
@sirosen
Copy link
Member

sirosen commented Mar 20, 2023

Let me know if you have any other feedback, otherwise I'll likely be merging this in the next day or two, especially given I've basically written enough of a first pass on my rust replacement for pyrsistent to demonstrate it's faster (still not as fast as it could be, but faster!)

Very exciting! I've been keeping one eye on your Rust work as I'm able. It's motivating me to get off my fanny and try learning Rust again. 😉

From my perspective, all is well with this change. I even found some schemas in SchemaStore which couldn't be loaded with RefResolver but which referencing handles correctly.

I realize I didn't get back to leaving a response on the dialect URI-related comment (which I think is the last one I didn't address but let me know if I missed another one!) --

I think that was the only open topic in this thread. Everything else seems to be resolved as we close this out.

specifically, URIs with empty fragments are supposed to be equivalent when used as JSON Schema $ids (see here if you care). So now if you write http://foo/# you indeed should get the same thing as http://foo/.

Makes sense; I wasn't aware of that detail of the spec. That may also explain why schemas are more often missing the trailing # than any other "error".

I think that leaves HTTP vs HTTPS -- I think for just that use case a helper probably isn't very useful since it's more about adding resources than retrieving them, but I'm not against some sort of helper at some point if there's more variants. Technically speaking these are different URIs, but if someone chooses to use the helper function I guess it's fine. But yeah maybe not for now, especially given the empty fragment case is handled?

Yeah, I am in agreement on this. I don't think the HTTP/HTTTPS URIs should be magically synonymized, since they are different.

It's an entirely different matter if a frontend like check-jsonschema wants to check for known HTTP/HTTPS URLs and emit errors like

There is no known metaschema with the ID "https://json-schema.org/draft-07/schema#".
Did you mean "http://json-schema.org/draft-07/schema#"?

and I'd even be open to the idea of including such behaviors in jsonschema, though I'm not sure how valuable it would be.


The only related thing that I'm still wondering about is how a user should declare and use a custom metaschema moving forward. I don't have any evidence of anyone doing this, but I used to have a clearer notion of how one would define a validator class and register it into jsonschema'. (I had never done this, but it was clearly possible.) It now seems like one would need to inform both jsonschema and referencing about a new $schema value? But I admit that I haven't been able to make much time for FOSS work lately, so I haven't tried poking at what's possible in the latest version of referencing.

@Julian
Copy link
Member Author

Julian commented Mar 20, 2023

Very exciting! I've been keeping one eye on your Rust work as I'm able. It's motivating me to get off my fanny and try learning Rust again. 😉

Hah nice. Well, I'll say the extent I've "learned" it is "I can write a bunch of & signs and when it compiles it seems to work :D". But it's definitely fun, come join :)

From my perspective, all is well with this change. I even found some schemas in SchemaStore which couldn't be loaded with RefResolver but which referencing handles correctly.

Nice! Yeah my guess is (thankfully? unsurprisingly?) it's probably a pretty compliant implementation -- I'm sure there's a bug or two hiding which will get uncovered and hopefully easily fixed, but clearly the primary concern at the minute is now getting the jsonschema-specific/backwards-compatible APIs "right enough" and in place.

Yeah, I am in agreement on this. I don't think the HTTP/HTTTPS URIs should be magically synonymized, since they are different.

👍🏽

The only related thing that I'm still wondering about is how a user should declare and use a custom metaschema moving forward.

Good catch :) I was waiting to see if someone would notice this. You're.. the second! See #1061. Needless to say though I thankfully hadn't forgotten about this, though we indeed don't have a test case that "reminds" me of it, but I remembered it anyhow somehow -- so this will happen before a real release (and there'll be another alpha or beta which contains support for it to ensure it works) -- until then there's a todo here reminding me and a ticket in referencing which is python-jsonschema/referencing#28 (and to a lesser extent python-jsonschema/referencing#24).

The basic idea I think is that every time you call jsonschema.validators.create, jsonschema will maintain some registry of custom metaschemas and then we add an argument to Resource.from_contents which is that mapping of dialects (i.e. we parameterize the ability to pass this specifications registry). There's probably a detail or two to think through but I'm reasonably confident this is easily solvable.

Thanks again for the comments, keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants