Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] new config system, 1.2 tagset support #700

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nitzmahone
Copy link
Member

@nitzmahone nitzmahone commented Feb 15, 2023

Here's a quick and dirty crack at a more broadly-encompassing dynamic config system like we've talked about before...

  • builds on @perlpunk 's excellent work getting 1.2 support into PyYAML by:
    • adding a TagSet type with some pre-defined instances for common 1.2 schemas
    • allowing users to create their own tagsets
  • leaves existing class hierarchy intact (no changes to instancing behavior)
  • introduces new Config mixins that:
    • creates dynamic Loader/Dumper subclasses to isolate custom configuration- no need to manually subclass or monkeypatch/twiddle config in global base classes. Note that this explicitly does not provide a way to do truly global config changes- I think that's pretty sane behavior for a library. The user is required to hang onto their customized Loader/Dumper types and explicitly pass them to yaml.load()/yaml.dump() et al to get the customized behavior, and the customized subclasses are GC'd once dropped on the floor.
    • allow complete inherit/add/replace of Loader/Dumper tagsets with any combo of builtin or user-defined
    • support config override of all existing Dumper config options (and arbitrary kwargs to Loader and Dumper to support any new config options).
  • provides new FastestBaseLoader and FastestBaseDumper helper classes backed by libyaml if available, and the pure-Python version if not
  • uses modern Python type-hinting for self-documentation, type/arg validation, and IDE auto-completion

We can also wrap up the existing individual types that make up the tagsets so that users can pick and choose, and combine partial schemas at will without having to redefine everything. Once that's all done, we should actually be able to completely redefine all the existing currently unrelated Loader and Dumper subclasses using this, and they'll actually be related in the class hierarchy instead of just sharing mixins.

A few examples of what's in and working:

json tagset on fastest available loader

import yaml
from yaml.tagset import json

# I want the fastest possible loader using only the 1.2 json tagset...
json_loader = yaml.FastestBaseLoader.config(tagset=json)
print(f'custom loader type is: {json_loader}')
print(f"result was {yaml.load('hi: mom', Loader=json_loader)}")

minimal custom tagset

import yaml
from yaml.tagset import TagSet

# I want the fastest possible loader using a custom minimal tagset
my_fallback_tagset = TagSet(name='my_fallback',
                            constructors={},
                            resolvers=[],
                            representers={})

yaml_to_load = """
    not_a_bool: true
    not_an_int: 123
"""

fallback_loader = yaml.FastestBaseLoader.config(tagset=my_fallback_tagset)
print(f'custom loader type is: {fallback_loader}')
print(f"result was {yaml.load(yaml_to_load, Loader=fallback_loader)}")

override dumper behavior so it doesn't have to be specified every call to dump

import yaml
from yaml.tagset import core

data = {
    'key': {
        'subkey1': 'subvalue1'
    }
}

# override dumper formatting on fastest available dumper
pedantic_documents_with_single_quotes_dumper = yaml.FastestBaseDumper.config(tagset=core, explicit_start=True, explicit_end=True, default_style="'", indent=2, canonical=True)

print(yaml.dump(data, Dumper=pedantic_documents_with_single_quotes_dumper))

Still TODO:

  • implement partial tagsets for Python types
  • wrap existing type definitions in a new dataclass
  • replace all existing Loader/Dumper subclasses with dynamically-created ones (eating our own dogfood)
  • fire deprecation warnings if direct modification of base classes is detected
  • tests
  • docs

nitzmahone and others added 8 commits September 22, 2021 09:52
so that other classes inheriting from it can use them

* Move methods from SafeConstructor to BaseConstructor
* Move methods from SafeRepresenter to BaseRepresenter
More and more YAML libraries are implementing YAML 1.2, either new ones
simply starting with 1.2 or older ones adding support for it.

While also the syntax was changed in YAML 1.2, this pull request is about the
schema changes.

As an example, in 1.1, Y, yes, NO, on etc. are resolved as booleans in 1.1.

This sounds convenient, but also means that all these 22 different strings must
be quoted if they are not meant as booleans. A very common obstacle is the
country code for Norway, NO ("Norway Problem").

In YAML 1.2 this was improved by reducing the list of boolean representations.

Also other types have been improved. The 1.1 regular expression for float allows
. and ._ as floats, although there isn't a single digit in these strings.

While the 1.2 Core Schema, the recommended default for 1.2, still allows a few
variations (true, True and TRUE, etc.), the 1.2 JSON Schema is there to match
JSON behaviour regarding types, so it allows only true and false.

Note that this implementation of the YAML JSON Schema might not be exactly like
the spec defines it (all plain scalars not resolving to numbers, null or
booleans would be an error).

Short usage example:

    class MyCoreLoader(yaml.BaseLoader): pass
    class MyCoreDumper(yaml.CommonDumper): pass
    MyCoreLoader.init_tags('core')
    MyCoreDumper.init_tags('core')
    data = yaml.load(input, Loader=MyCoreLoader)
    output = yaml.dump(data, Dumper=MyCoreDumper)

Detailed example code to play with:

    import yaml

    class MyCoreLoader(yaml.BaseLoader): pass
    MyCoreLoader.init_tags('core')

    class MyJSONLoader(yaml.BaseLoader): pass
    MyJSONLoader.init_tags('json')

    class MyCoreDumper(yaml.CommonDumper): pass
    MyCoreDumper.init_tags('core')

    class MyJSONDumper(yaml.CommonDumper): pass
    MyJSONDumper.init_tags('json')

    input = """
    - TRUE
    - yes
    - ~
    - true
    #- .inf
    #- 23
    #- #empty
    #- !!str #empty
    #- 010
    #- 0o10
    #- 0b100
    #- 0x20
    #- -0x20
    #- 1_000
    #- 3:14
    #- 0011
    #- +0
    #- 0001.23
    #- !!str +0.3e3
    #- +0.3e3
    #- &x foo
    #- *x
    #- 1e27
    #- 1x+27
    """

    print('--------------------------------------------- BaseLoader')
    data = yaml.load(input, Loader=yaml.BaseLoader)
    print(data)
    print('--------------------------------------------- SafeLoader')
    data = yaml.load(input, Loader=yaml.SafeLoader)
    print(data)
    print('--------------------------------------------- CoreLoader')
    data = yaml.load(input, Loader=MyCoreLoader)
    print(data)
    print('--------------------------------------------- JSONLoader')
    data = yaml.load(input, Loader=MyJSONLoader)
    print(data)

    print('--------------------------------------------- SafeDumper')
    out = yaml.dump(data, Dumper=yaml.SafeDumper)
    print(out)
    print('--------------------------------------------- MyCoreDumper')
    out = yaml.dump(data, Dumper=MyCoreDumper)
    print(out)
    print('--------------------------------------------- MyJSONDumper')
    out = yaml.dump(data, Dumper=MyJSONDumper)
    print(out)
This way people can play with it, and we don't promise this wrapper will stay
around forever, and newly created classes CommonDumper/CommonRepresenter aren't
exposed.

    MyCoreLoader = yaml.experimental_12_Core_loader()
    data = yaml.load(input, Loader=MyCoreLoader)

    MyCoreDumper = yaml.experimental_12_Core_dumper()
    out = yaml.dump(data, Dumper=MyCoreDumper)
* Loader/Dumper config mixins to create dynamic types and configure them at instantiation with generated partials
* New `FastestBaseLoader`/`FastestBaseDumper` base classes to auto-select C-back impl if available

# preserve wrapped config defaults for values where we didn't get a default
# FIXME: share this code with the one in __init__.dump_all (and implement on others)
dumper_init_kwargs = dict(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be better done with inspect.signature(); then it's 100% dynamic off whatever init args the current class accepts

@ingydotnet
Copy link
Member

This is a great start, Matt!

I really like the approach overall of generating a configured subclass around the existing architecture.
We get custom configured subclasses (which we can instantiate into objects if we want).
No existing code breaks.
With configuration we'll be able to support all the things I'm on the fence about supporting by just making them config options.

The new .config(...) method returns a new generated subclass. A more precise method name would be .subclass_using_config(...).
I really like the idea of calling this method .new(...). My second favorite would be .subclass(...).

While we can add this method to the the Loader and Dumper classes I think that: yaml.new(...config..) should return a YamlClass class containing both a generated Loader class and a generated Dumper class. Then you can cleanly do:

import yaml
myaml = yaml.new(tagset=yaml.tagset.json, indent=3)
text = 'hi: mom'
data = myaml.load(text)
text = myaml.dump(data)

I think that's a super clean top level way of using the new ideas you've added.

myaml.load(text)
# is a lot simpler and more intuitive than using the old:
yaml.load(text, Loader=myaml.loader)

Also, the .new() method would default to .new(base=FastestBase) which is same .new(base_loader=FastestBaseLoader, base_dumper=FastestBaseDumper). This way when we add libfyaml support soon, we people can fyaml = yaml.new(base=FYaml). Note: fyaml is fast like libyaml but also passes 100% of the test suite and is very actively maintained.
PyYAML and libyaml only pass ~82% of the suite but they share a lot of the same failures, which is to say they are quite bugwards compatible with each other. So I the the FastestBase... or whatever we want to call it should just select between those 2, and newer (even if better) backends should be explicitly requested.

I assume you can do this:

ayaml = yaml.new(...config-stuff...)
byaml = ayaml.new(...more-config-stuff...)

All this is to say, I really like where you have gone so far, at least as I understand it. I'd just like to see the common usage idioms be even cleaner. I'll try to write up a file of all the possible usages so we can discuss them among the release team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants