Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to dump the formats Fold (>) or Block (|) using the pyyaml #693

Open
krishna3008 opened this issue Jan 17, 2023 · 2 comments
Open

How to dump the formats Fold (>) or Block (|) using the pyyaml #693

krishna3008 opened this issue Jan 17, 2023 · 2 comments

Comments

@krishna3008
Copy link

Hi,

I have a YAML syntax where I want to use the fold (>) property of the YAML in the original
I want to dump the same character (with the fold block of the set) using the library, is this possible?

EG:-
groups: >-
{
'{{PREFIX }}-sys':[
'read',
'write',
'annotate',
'distribute'
]
}

Also the format as
PREFIX: "{{ project_config.PROJECT_ACRONYM }}-
{{ project_config.PROJECT_CUSTOMER }}-
{{ project_config.PROJECT_ID }}"

Is it possible to dump the YAML with the format intact (or) can any arguments added to get this format

@MajorDallas
Copy link

There's a work-around described in https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data

However: the fact that this (barely documented) work-around is necessary to begin with is currently my biggest complaint with PyYAML and YAML generally: it is seemingly not possible to parse a document into data, then serialize that data into YAML and get an identical document as output.

Now, probably, that's little more than an annoyance in the majority of cases... but ultimately it's the application which must use the string content of a scalar, and messing with the quoting could mess with the application. Eg., an unintentionally added chomp operator could strip a newline that might have been important (though admittedly most applications should be resilient to this, how could we forget mousepad bug 4824 that was open for 10 years?). Or, changing > to | would result in the output file being visually reflowed, possibly even trigger warnings from linters if eg. the new style results in lines longer than the set column width, or going from | to > you end up with double the number of lines in the document for each affected scalar. God forbid you have these documents in source control and have to review those diffs.

Kinda getting into yaml/yaml-spec#289 on this, wishing for a round-trip in which no modifications are made to produce "string-equal" representations on both ends. Personally I consider "semantic equivalence" to be determined by the consumer of the document, and the only way for PyYAML to ensure semantic equivalence for all consumers is to guarantee string equivalence between source and output in a no-change round-trip. That means (at least) retaining scalar style.

Ideally this would also extend so that if a scalar is modified by the application but its style is not, the same style should be used in the output document, eg.

*** 1,3 ****
  foo: |
!   some multiline
!   value that will change
--- 1,2 ----
  foo: |
!   now it fits on one line, but we still use "|" style

Unfortunately, retaining the original quote style through a round trip won't be possible as long as the data is being loaded into a regular dict. It simply doesn't allow for that kind of metadata without some ugly hack like making every scalar a tuple like {"some_scalar": ("|", "the actual string"), "another_one": ("'", "the other string")}.

Probably the best solution would be to ditch dict and build a new type that implements MutableMapping and keeps all the Node objects created during parsing. You would access arbitrary keys through __getitem__ just as now, so existing code wouldn't break (except maybe static type checks expecting dict), but what it actually does is look up the Node object at the given path and return its value. During the dump process, instead of taking the value and making guesses about what style to use in the absence of a default_style argument, a ScalarNode object can actually be consulted for what style it was loaded with.

Some drawbacks:

  • A Node requires more RAM than its value alone.
  • A ScalarNode (subclass?) will have to store both the "raw" value as it appeared in the document before applying any style as well as the styled string. Otherwise, an unmodified ">" scalar would lose all folded newlines on output and no longer be string-equivalent. This almost doubles (or more) the memory footprint for ScalarNode.
  • A "Document" class like this would also be significantly larger and heavier than a dict.
  • Having to do an if isinstance(node, Node) for every node is going to slow down the dump process. This becomes important if dumping from an object where nodes were added but not as Node objects. This could maybe be avoided with a careful implementation of __setitem__ or other descriptors?
  • Even if you can avoid the conditional, it still adds time for an additional attribute lookup on every scalar node. This is probably insignificant for the vast majority of use-cases, and I would think that projects where speed is so important that attribute lookup time is significant are probably not using YAML or Python to begin with.

Those drawbacks can be (at least initially) addressed by making this opt-in as new Loader and Dumper subclasses that implement this behavior but are not used by default in dump et al. Then users can decide if they accept the trade-offs and optimizations could be added later. Technically any programmer like me that wants this behavior could implement such subclasses in their own projects, but imo it's not unreasonable for users to expect a round-trip to not introduce any changes to their document and this should have been the default behavior from the very beginning. At the very least, this should be an available option out of the box.

@nitzmahone
Copy link
Member

There's no great way to do this today with stock PyYAML, but it's also not terribly difficult to do some lightweight customizations to smuggle parser details like this either on expando/subclass attrs of the various native representations, or in a weakref'd lookup table that relies on object identity (which can then be consulted by an associated dumper attempting to re-create state). It's also one of the ways I've hacked in comment roundtrip support before (no, I'm not shipping it 😆 ).

As you've alluded, ultimately, there are several spec gray-areas (and not-so-gray areas) that kinda prevent doing this stuff generically, as well as some limitations of Python (and other YAML impl/deserialization languages) that make parts of it tricky to do. It's not impossible, and a lot of projects have managed to hack it in for their own needs, but I also wouldn't hold my breath for a generic solution anytime soon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants