Rework the safety related API code #189

ingydotnet · 2018-06-28T22:02:29Z

The main change is that 'Danger' has been renamed to 'Python' and that
the default dump() and dump_all() functions use the 'Python' schema
to be able to dump any Python data structure.

NOTE: In YAML, 'Schema' is used to mean all the semantics and rules of
what a YAML document means and how it is processed.

The load() and load_all() functions continue to use the Safe schema.

The dump() and load() sugar functions should be similar in that they
both do the must useful and safe operations.

There are top level functions for each schema (Safe and Python) and
those functions should be used when feeding data from one system to the
other and expecting the same semantics (schema):

safe_dump safe_dump_all
safe_load safe_load_all
python_dump python_dump_all
python_load python_load_all

When we have a schema language for YAML, the generic methods with be:

yaml.dump(node, Schema='foo.schema')
yaml.load(yaml, Schema='foo.schema')

A loader class like SafeLoader is a loader with a hardcoded schema.
Right now pyyaml has 2 schemas:

Python - serialize any python data
Safe - only serialize in a way that won't trigger code

'Danger' was used in response to a situation where people were caught
unaware that something bad could happen in a seemingly normal, default
situation. Now we've fixed the default to be safe, and Safe is an OK
name for a schema, but Danger really is not. It's not the purpose of the
schema to be dangerous. The purpose is to serialize Python data
structures.

The danger_ API functions can be removed because they have only been
released for a couple days and they aren't documented anywhere.

This also fixes a bug in that safe_load() and load() were aliases. They
shouldn't be, because load() accepts a Loader kwarg, and safe_load()
should not. ie safe_load(yaml, Loader=PythonLoader) shouldn't be
allowed.

The main change is that 'Danger' has been renamed to 'Python' and that the default `dump()` and `dump_all()` functions use the 'Python' schema to be able to dump any Python data structure. NOTE: In YAML, 'Schema' is used to mean all the semantics and rules of what a YAML document means and how it is processed. The `load()` and `load_all()` functions continue to use the Safe schema. The dump() and load() sugar functions should be similar in that they both do the must useful and safe operations. There are top level functions for each schema (Safe and Python) and those functions should be used when feeding data from one system to the other and expecting the same semantics (schema): * safe_dump safe_dump_all * safe_load safe_load_all * python_dump python_dump_all * python_load python_load_all When we have a schema language for YAML, the generic methods with be: * yaml.dump(node, Schema='foo.schema') * yaml.load(yaml, Schema='foo.schema') A loader class like SafeLoader is a loader with a hardcoded schema. Right now pyyaml has 2 schemas: * Python - serialize any python data * Safe - only serialize in a way that won't trigger code 'Danger' was used in response to a situation where people were caught unaware that something bad could happen in a seemingly normal, default situation. Now we've fixed the default to be safe, and Safe is an OK name for a schema, but Danger really is not. It's not the purpose of the schema to be dangerous. The purpose is to serialize Python data structures. The danger_ API functions can be removed because they have only been released for a couple days and they aren't documented anywhere. ---- This also fixes a bug in that safe_load() and load() were aliases. They shouldn't be, because load() accepts a Loader kwarg, and safe_load() should not. ie safe_load(yaml, Loader=PythonLoader) shouldn't be allowed.

alex · 2018-06-28T22:04:10Z

I do not believe the name python_load adequately conveys that using it on untrusted input is the moral equivalent of calling eval on it.

…

On Thu, Jun 28, 2018 at 6:02 PM Ingy döt Net ***@***.***> wrote: The main change is that 'Danger' has been renamed to 'Python' and that the default dump() and dump_all() functions use the 'Python' schema to be able to dump any Python data structure. NOTE: In YAML, 'Schema' is used to mean all the semantics and rules of what a YAML document means and how it is processed. The load() and load_all() functions continue to use the Safe schema. The dump() and load() sugar functions should be similar in that they both do the must useful and safe operations. There are top level functions for each schema (Safe and Python) and those functions should be used when feeding data from one system to the other and expecting the same semantics (schema): - safe_dump safe_dump_all - safe_load safe_load_all - python_dump python_dump_all - python_load python_load_all When we have a schema language for YAML, the generic methods with be: - yaml.dump(node, Schema='foo.schema') - yaml.load(yaml, Schema='foo.schema') A loader class like SafeLoader is a loader with a hardcoded schema. Right now pyyaml has 2 schemas: - Python - serialize any python data - Safe - only serialize in a way that won't trigger code 'Danger' was used in response to a situation where people were caught unaware that something bad could happen in a seemingly normal, default situation. Now we've fixed the default to be safe, and Safe is an OK name for a schema, but Danger really is not. It's not the purpose of the schema to be dangerous. The purpose is to serialize Python data structures. The danger_ API functions can be removed because they have only been released for a couple days and they aren't documented anywhere. ------------------------------ This also fixes a bug in that safe_load() and load() were aliases. They shouldn't be, because load() accepts a Loader kwarg, and safe_load() should not. ie safe_load(yaml, Loader=PythonLoader) shouldn't be allowed. ------------------------------ You can view, comment on, or merge this pull request online at: #189 Commit Summary - Rework the safety related API code File Changes - *M* lib/yaml/__init__.py <https://github.com/yaml/pyyaml/pull/189/files#diff-0> (89) - *M* lib/yaml/cyaml.py <https://github.com/yaml/pyyaml/pull/189/files#diff-1> (20) - *M* lib/yaml/dumper.py <https://github.com/yaml/pyyaml/pull/189/files#diff-2> (10) - *M* lib/yaml/loader.py <https://github.com/yaml/pyyaml/pull/189/files#diff-3> (10) - *M* lib3/yaml/__init__.py <https://github.com/yaml/pyyaml/pull/189/files#diff-4> (89) - *M* lib3/yaml/cyaml.py <https://github.com/yaml/pyyaml/pull/189/files#diff-5> (20) - *M* lib3/yaml/dumper.py <https://github.com/yaml/pyyaml/pull/189/files#diff-6> (10) - *M* lib3/yaml/loader.py <https://github.com/yaml/pyyaml/pull/189/files#diff-7> (10) - *M* tests/lib/test_constructor.py <https://github.com/yaml/pyyaml/pull/189/files#diff-8> (4) - *M* tests/lib/test_recursive.py <https://github.com/yaml/pyyaml/pull/189/files#diff-9> (6) - *M* tests/lib3/test_constructor.py <https://github.com/yaml/pyyaml/pull/189/files#diff-10> (4) - *M* tests/lib3/test_recursive.py <https://github.com/yaml/pyyaml/pull/189/files#diff-11> (6) Patch Links: - https://github.com/yaml/pyyaml/pull/189.patch - https://github.com/yaml/pyyaml/pull/189.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#189>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAADBA-sJigXK-qUwOPe9jhov91EC0s1ks5uBVJ2gaJpZM4U8JP-> .

-- "I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire) "The people's good is the highest law." -- Cicero GPG Key fingerprint: D1B3 ADC0 E023 8CA6

ingydotnet · 2018-06-28T22:04:31Z

This relates to #187

ingydotnet · 2018-06-28T22:13:16Z

@alex I think this is an issue to be taken care of in the docs.

Adding a 'danger_' prefix to things doesn't get across the purpose of the function. The purpose is not to do something dangerous, it's to load python data. Which btw is only potentially dangerous (if you are using untrusted input).

The 'safe_' prefix, otoh does indicate the purpose. To load data that matches the safe schema and thus is believed not to trigger code using untrusted data.

perlpunk

looks good to me. I also suggested python_load. I think it's a good name because it enables !!python/... tags.

ingydotnet · 2018-06-28T23:04:23Z

@perlpunk interesting mnemonic 👍

Kirill gave me the pyyaml.org content, so we can update the docs there soon. He also suggested putting it all on https://readthedocs.org/ which I think is a great idea.

glyph · 2018-06-29T00:10:14Z

I was very excited to see the renaming to danger_ because this makes pyyaml uncharacteristically safe; far too many yaml libraries are the source of distressing remote code execution vulnerabilities.

danger_ is a good name for these functions. Please don't change the name.

The danger is multi-faceted:

Serialization of arbitrary Python objects with no restrictions makes it much easier to accidentally serialize data that was not meant to be present in the serialization. This increases the risk of catastrophically huge serialized payloads, unintended sensitive information being saved, etc. So it's dangerous in that you might serialize too much.
(De)-serialization of arbitrary objects with no explicit schema guarantees or type information makes it likely that you might rehydrate an object in a state that was valid in a previous version of your library, but is now invalid. Combined with the first point, it's likely that you accidentally serialize objects that you had no intent of enforcing a compatibility contract on, but are now committed to maintaining forever because they show up in your customer's saved state files. So it's dangerous because it may massively increase your maintenance and compatibility burden over time, and it's dangerous because it may load objects that cause your program to break in hard-to-debug ways.
Expressing this in terms of saving and loading "python objects" makes many things superficially easier. It makes the operations sound safe, unless you are the sort of programmer who has previously put your own eye out with a ZODB; but in that case, you'll know to just never use these functions, but this terminology actively misleads newcomers. By default, people should think of the "python schema" here as extremely scary and dangerous, and only feel safe calling these APIs after they have FULLY considered the implications. So this naming is dangerous because it puts the burden of understanding on the audience least prepared to do that understanding.

In my considered opinion, there are exactly 2 times when it is valid to serialize arbitrary objects in this way:

when you are in a prototyping environment and you know you are going to throw away all your saved data after a few iterations and commit to something more strictly enforced, and
when you are communicating data between two processes that are part of the same system, where one has launched the other, and you know for sure that they are running the exact same version of the code.

These are both fairly niche applications, neither of which should be the primary use of pyyaml, which is to produce and consume vaguely human-readable configuration documents. So the fact that danger_ naming will likely make this type of usage considerably more obscure is very much a good thing.

ingydotnet · 2018-06-29T01:09:41Z

@glyph,

There are a lot of dangerous operations in computing but I've never seen the prefix danger_ applied to any of them. The eval function is considered dangerous in any language but we don't call it danger_eval.

Any operation that you don't understand is potentially dangerous. The solution is to help people understand what is dangerous (and why) and what is not. We can do that by explaining things well in the docs.

I might be OK with adding a danger_load alias for the python_load call which is really the same as yaml.load(text, Loader=PythonLoader). So then you can use it if that's your style.

The other option is to get rid of both (danger_ and python_) shortcuts, since yaml.load() has already been made safe now. If people want to load unsafe things they could just be explicit with yaml.load(text, Loader=PythonLoader). (Note: we can't get rid of safe_load as it already exists in 3.12)

Regarding the danger_dump function, it's pretty useless and misleading because dumping is not known to be dangerous, and the default behavior of yaml.dump(...) should be to dump whatever it's given.

Regarding the naming of the Loader/Dumper classes, these really need to be named after the schemas they are enforcing.

Overall I think naming things 'danger' is both unprecedented and it's going to feel silly to be forced to use it after you understand how things work. I think it's a short-sighted solution, just as thinking of the YAML Data Language as a configuration language is a short-sighted point of view.

sigmavirus24 · 2018-06-29T01:40:35Z

I don't really have much time, but let me provide some things that are separate from the discussion of naming.

Let's say I have something I want to serialize with PyYAML. Let's say I've sanitized most of it but accidentally missed a random object and now I use yaml.dump to serialize it to YAML. I've now, unknowingly, caused it to include the !!python/. Unsuspectingly, I now go and try to load it with yaml.load and I can't. Further, if I had used danger_load my application may have broken in production.

I don't particularly care about the danger_ prefix or not (although I prefer it) but I think everything should be safe by default. I also agree with @glyph and @alex unsurprisingly that the python_ prefix is not quite helpful. Speaking as someone working on a product that had problems with serializing and deserializing python objects earlier this year that caused significant production downtime, I think the current API sufficiently protects users, which is what library designers need to be concerned about. Documenting one's way out of an unsafe paper bag is a disservice to users.

ingydotnet · 2018-06-29T02:14:45Z

@sigmavirus24 perhaps the best thing for now is to not pollute the import namespace with new functions that are not great. Since we can get what we want with explicit calls let's just use those for now.

People who want to serialize outside the Safe schema can do so without the sugar. If they don't like it, we will get real world feedback instead of subjective opinions.

Note: The safe_ functions have been in PyYAML since the start. Making things safe by default kind of defeats the purpose of having them there but they'll need to stay.

ingydotnet · 2018-06-29T04:34:37Z

@sigmavirus24 after thinking more on this I believe it would be wise to revert the #74 merge and let it sit out the 4.2 release. There is a ton going on already and this commit is the most contentious and breaks backward compatibility for almost all usage.

I am sorry that I wasn't paying attention when this got merged last August, but then again something of this magnitude shouldn't have gotten in without my signoff. I'm quite sure I would have had very similar reactions back then.

I agree that this is issue is very important, but I think it deserves a release of its own, and should not be lumped in with an already heavily overloaded 4.2 release.

glyph · 2018-06-29T04:41:21Z

There are a lot of dangerous operations in computing but I've never seen the prefix danger_ applied to any of them. The eval function is considered dangerous in any language but we don't call it danger_eval.

Yes, you're right, as an industry we have often put the public at risk and treated users' security somewhat cavalierly. It's understandable; software is a new discipline, few of us have any formal training, formal training has huge gaps and frequently falls behind; so we all have a lot to learn.
The term "hazardous materials" as popularized by https://cryptography.io/en/latest/hazmat/primitives/ would be an acceptable substitute for "danger".

glyph · 2018-06-29T04:46:28Z

... this commit is the most contentious and breaks backward compatibility for almost all usage ...

If you're going to reverse course on this, I think this CVE needs to be updated to indicate that subsequent versions are still vulnerable? http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-18342

ingydotnet · 2018-06-29T05:54:01Z

@glyph there is nothing inherently dangerous about the yaml.load function (or any language's eval function). If I'm using either on trusted data there's no reason to be forced to write code that looks like yaml.danger_load(trusted_yaml). It's annoying and confusing.

re the CVE, technically the course is already reversed as 4.1 was retracted from PyPI earlier today.
I find the CVE to be a bit ironic because safe_load has been a part of PyYAML since the 3.01 release, over 12 years ago.

cdent · 2018-06-29T08:35:46Z

From my standpoint, I think the danger* terms are potentially useful for sugar for the exposed interface that people commonly use as they signal "dragons!". safe_load, python_load and danger_load could all be facades over various calls to load with different Loaders (some the same). Under the hood, however, I think naming the loaders and schema "Danger" is something akin to over the top because it is sort of inflammatory and non-descriptive.

I think this option

The other option is to get rid of both (danger_ and python_) shortcuts, since yaml.load() has already been made safe now. If people want to load unsafe things they could just be explicit with yaml.load(text, Loader=PythonLoader)

is also a good one. The critical aspect of it is that you're doing something specific to overcome safe-by-default. This might also be reasonable for dump but I'm not really sure because dumping is not something I do a lot of. The scenario of confusion that @sigmavirus24 is pretty compelling.

With regard to where to go from here: safe-by-default seems to me like the keystone feature of a big new PyYAML 4, so releasing something without it doesn't seem like the way to go.

(And to be self-serving, but the sugar aspect of things is less critical to me than getting a clear and consistent definition of what "safe" will mean: #187.)

perlpunk · 2018-06-29T14:12:44Z

Just throwing in some thoughts.

If the majority thinks there has to be a shortcut for *load() that is not safe, I would call it unsafe_load for example, instead of danger_load.

I'm coming from perl and did not know what kinds of exploits are possible with yaml.load.
In the various perl YAML modules, exploits are not trivial to create and only work if they fit to the code that is doing the load (means that the module the exploit targets has to be already loaded). No modules are loaded (you can create an instance of a class without loading that module in perl). There is no eval or anything involved unless you enable a specific option to load literal perl code.

In pyyaml, the situation seems to be different, and it's very easy to inject any code. So I'm in favor of dropping python_load and danger_load to avoid having easy shortcuts.

I still don't like the name DangerLoader, but I can see why you think PythonLoader may sound too harmless. How about something like PythonCodeLoader or PythonEvalLoader?

sigmavirus24 · 2018-06-30T13:06:43Z

@perlpunk The old load behaviour allows for Remote Code Execution, see #193 (comment)

perlpunk · 2019-03-14T10:42:55Z

obsolete, see #257

perlpunk self-requested a review June 28, 2018 22:55

perlpunk approved these changes Jun 28, 2018

View reviewed changes

ingydotnet mentioned this pull request Jun 29, 2018

PyYAML 4.2 Release Plan #193

Closed

ingydotnet mentioned this pull request Jul 1, 2018

Reverting https://github.com/yaml/pyyaml/pull/74 #194

Closed

ingydotnet added this to To Do in 4.2 Release Planning Jul 2, 2018

ingydotnet moved this from To Do to In Progress in 4.2 Release Planning Jul 2, 2018

ingydotnet moved this from In Progress to Done in 4.2 Release Planning Jul 2, 2018

ingydotnet removed this from Done in 4.2 Release Planning Jul 2, 2018

acaprari mentioned this pull request Jul 23, 2018

Add pyyaml < 4 for CVE-2017-18342 pyupio/safety-db#2271

Closed

This was referenced Sep 4, 2018

Use 'yaml.safe_load' in 'load_yaml_from_docstring' marshmallow-code/apispec#278

Closed

Use yaml.safe_load() instead of load() marshmallow-code/apispec#281

Merged

perlpunk closed this Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the safety related API code #189

Rework the safety related API code #189

ingydotnet commented Jun 28, 2018

alex commented Jun 28, 2018 via email

ingydotnet commented Jun 28, 2018

ingydotnet commented Jun 28, 2018

perlpunk left a comment

ingydotnet commented Jun 28, 2018 •

edited

glyph commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

sigmavirus24 commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

glyph commented Jun 29, 2018

glyph commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

cdent commented Jun 29, 2018

perlpunk commented Jun 29, 2018

sigmavirus24 commented Jun 30, 2018

perlpunk commented Mar 14, 2019

Rework the safety related API code #189

Rework the safety related API code #189

Conversation

ingydotnet commented Jun 28, 2018

alex commented Jun 28, 2018 via email

ingydotnet commented Jun 28, 2018

ingydotnet commented Jun 28, 2018

perlpunk left a comment

Choose a reason for hiding this comment

ingydotnet commented Jun 28, 2018 • edited

glyph commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

sigmavirus24 commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

glyph commented Jun 29, 2018

glyph commented Jun 29, 2018

ingydotnet commented Jun 29, 2018

cdent commented Jun 29, 2018

perlpunk commented Jun 29, 2018

sigmavirus24 commented Jun 30, 2018

perlpunk commented Mar 14, 2019

ingydotnet commented Jun 28, 2018 •

edited