Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PyYAML special tag in the Kedro config loader #1011

Closed
datajoely opened this issue Nov 1, 2021 · 5 comments
Closed

Enable PyYAML special tag in the Kedro config loader #1011

datajoely opened this issue Nov 1, 2021 · 5 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@datajoely
Copy link
Contributor

datajoely commented Nov 1, 2021

Description

PyYAML supports a large range of non-standard YAML magic functions which allow the user to pass in special tags that extend the standard YAML functionality beyond what is allowed out of the box + cross platform. Examples include !!tuple for casting a list to a python tuple or event complex tags such as !!python/object:module.cls for a class instance.

Users in the past have asked for tuple and object support so it is something that would be helpful.

Possible Implementation

Kedro uses anyconfig to parse configuration files and we can coerce it into parsing this special PyYAML syntax like follows:

from yaml import UnsafeLoader
import anyconfig
anyconfig.loads("my_object: !!python/tuple [0.3, 0.4]", ac_parser="yaml", Loader=UnsafeLoader)
>>> {'my_object': (0.3, 0.4)}

To enable this it would be nice to put a toggle in the TemplatedConfigLoader constructor with a boolean argument like allow_unsafe and an extra check at this point to enable this special loading mechanism if the ConfigLoader is parsing a file ending with .yml or .yaml:
https://github.com/quantumblacklabs/kedro/blob/be3022ecf9ea18f6a448ef9d2c57bd8ad089c769/kedro/config/config.py#L156

Possible Alternatives

Keep as it is, there may be some unintended security risks enabling this functionality.

@datajoely datajoely added the Issue: Feature Request New feature or improvement to existing feature label Nov 1, 2021
@datajoely
Copy link
Contributor Author

For reference: https://pyyaml.org/wiki/PyYAMLDocumentation

@antonymilne
Copy link
Contributor

I like this idea. My general opinion on config is that we should let users do whatever they want with it and not be constrained by the strict rules of yaml. And since this functionality is already in pyyaml, great. I think we should either support it natively in kedro or provide documentation on how users can write their own config loader to enable it. The only thing I don't like is that it is specific to yaml (as opposed to the other config types we support), but in reality does anyone use those other config types anyway?

Implementation

I would suggest three different things:

  1. make this as general as possible by allowing to pass whatever arguments they want to anyconfig.load rather than using an allow_unsafe boolean that fixes certain arguments
  2. maybe put this in a whole new SomethingConfigLoader rather than in TemplatedConfigLoader ❓ Not sure
  3. maybe use FullLoader rather than UnsafeLoader ❓ Not sure

Looking at develop, _get_config_from_patterns has an argument that is ac_template (set to False by default, and True for TemplatedConfigLoader). This is eventually passed into anyconfig.loads in _load_config_file as
anyconfig.load(yml, ac_template=ac_template). We could put all these arguments (ac_template, ac_parser, Loader) into some ac_load_args dictionary instead. This is tidier and makes it general rather than yaml-specific (e.g. in theory you might want to pass the decoder argument if you're using toml?).

We would then either expose ac_load_args through SomethingConfigLoader.__init__ (hence can be set in settings.py) or just hardcode them into SomethingConfigLoader itself and set a flag like you suggest.


Side note: FullLoader vs UnsafeLoader.
If we do this then I'd like to understand a bit better exactly what the difference is between these. The docs suggest that FullLoader is safer, but it doesn't seem to work for tuples even though it really looks from the source code and docs like it should. I actually don't understand why this doesn't work out of the box anyway, since anyconfig.load ends up calling yaml.load rather than yaml.safe_load anyway 🤔 Is it something to do with the ac_dict argument? Relevant bits of code are in https://github.com/yaml/pyyaml/blob/master/lib/yaml/__init__.py and https://github.com/yaml/pyyaml/blob/master/lib/yaml/constructor.py. Also see yaml/pyyaml#321. So a few things we should understand better here.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 9, 2022
@antonymilne antonymilne removed the stale label Jan 10, 2022
Galileo-Galilei pushed a commit to Galileo-Galilei/kedro that referenced this issue Feb 19, 2022
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Apr 4, 2022
@filpano
Copy link

filpano commented Nov 17, 2022

For anyone else stumbling on this when trying to add custom filters in the dask.ParquetDataSet or pandas.ParquetDataSet: here is a workaround which was already mentioned in a Q&A discussion.

@merelcht
Copy link
Member

I'm going to close this issue, because we intend to move away completely from anyconfig and instead use omegaconf for handling more advanced merging and complex config.

@merelcht merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants