Parser agnostic i18n post transform #12238

n-peugnet · 2024-04-07T18:40:48Z

This is a proof of concept to fix #8852 based on my idea : #8852 (comment)

Feature or Bugfix

Bugfix

Purpose

By adding a parse_inline() function to the Parser, we can get rid of all the RST specific hacks that the i18n post transform contained:

The title hack is not needed any-more, since we only parse inline elements.
The literal block hack has been moved to RSTParser.parse_inline, leaving the post_transform parser agnostic.

I also successfully implemented this function in MyST-Parser which is I think the second most used Sphinx parser: executablebooks/MyST-Parser@master...n-peugnet:MyST-Parser:add-parse-inline

Detail

I am not yet sure about the API and I am absolutely open to discussions about this.
In the process of removing RST specific code, I had to stop emitting RST code for images (now they won't be parsed anyway with parse_inline()), so I chose to instead only emit the url as the message to be translated.
I checked with diff -r the results of Sphinx's doc's french translation and the result is identical except for the autodoc and python module parts. To make this check I recommend doing a git reset master to keep the same commit hash (otherwise a lot of pages are different) and to comment out the non-implemented parse_inline() method of the Parser class. This allows to keep the same inventory as the master branch.

Relates

From my testing, it allows to fix at least executablebooks/MyST-Parser#852, but could potentially fix all the issues linked in #8852 (didn't check yet).

Fixes #8852
Fixes executablebooks/MyST-Parser#444
Fixes executablebooks/MyST-Parser#852
Fixes #12287

Allows to not rely on strange hacks that are RST dependant. There is still an issue With the warning of missing literal block

So we trim the literal suffix to avoid warnings and we add it back at the end

…y patched

chrisjsewell · 2024-04-07T21:30:10Z

sphinx/parsers.py

+        self.statemachine = states.RSTStateMachine(
+            state_classes=self.state_classes,
+            initial_state='Text',
+            debug=document.reporter.debug_flag,
+        )
+
+        inputlines = StringList([inputstring], document.current_source)
+
+        self.decorate(inputlines)
+        self.statemachine.run(inputlines, document, inliner=self.inliner)
+        self.finish_parse()
+        if has_literal:
+            p = document[0]
+            assert isinstance(p, nodes.paragraph)
+            p += nodes.Text(':')


This should all be using the self.inliner to only parse inline syntaxes https://github.com/live-clones/docutils/blob/d50e1676a87f5a495f1a5a0f447e8da9317e1195/docutils/docutils/parsers/rst/states.py#L614

Well this was my initial idea, but it was a lot more complicated to implement. I don't remember if I managed to make it work in the end, but if I did, then it had the same result as using Text as the initial_state (the :: without literal block were still causing issues), as I did on line 77

sphinx/sphinx/parsers.py

Line 77 in 5994ca5

initial_state='Text',

Ok yes I indeed managed to make it work, but it looked like this:

diff --git a/sphinx/parsers.py b/sphinx/parsers.py index 09ee7e8ff..a99c6d498 100644 --- a/sphinx/parsers.py +++ b/sphinx/parsers.py @@ -7,7 +7,7 @@ from typing import TYPE_CHECKING import docutils.parsers import docutils.parsers.rst from docutils import nodes -from docutils.parsers.rst import states +from docutils.parsers.rst import states, languages from docutils.statemachine import StringList from docutils.transforms.universal import SmartQuotes @@ -71,18 +71,26 @@ class RSTParser(docutils.parsers.rst.Parser, Parser): if has_literal: inputstring = inputstring[:-2] - self.setup_parse(inputstring, document) # type: ignore[arg-type] - self.statemachine = states.RSTStateMachine( - state_classes=self.state_classes, - initial_state='Text', - debug=document.reporter.debug_flag, - ) - - inputlines = StringList([inputstring], document.current_source) - - self.decorate(inputlines) - self.statemachine.run(inputlines, document, inliner=self.inliner) - self.finish_parse() + language = languages.get_language( + document.settings.language_code, document.reporter) + if self.inliner is None: + inliner = states.Inliner() + else: + inliner = self.inliner + inliner.init_customizations(document.settings) + memo = states.Struct(document=document, + reporter=document.reporter, + language=language, + title_styles=[], + section_level=0, + section_bubble_up_kludge=False, + inliner=inliner) + memo.reporter.get_source_and_line = lambda x: (document.source, x) + textnodes, _ = inliner.parse(inputstring, 1, memo, document) + p = nodes.paragraph(inputstring, '', *textnodes) + p.source = document.source + p.line = 1 + document.append(p) if has_literal: p = document[0] assert isinstance(p, nodes.paragraph)

then it had the same result as using Text

Text also passes definition lists and section titles; its definitely much cleaner to use the proper inline parsing, even if docutils does not make this as easy 😒

textnodes, _ = inliner.parse(inputstring, 1, memo, document)

here you should also have parse_inline take the actual line number and use that, plus the _ is system_message nodes that should be appended to the paragraph

but since the message is translated in .po files, there is no real source line number to provide.

Ok, but here you are proposing to add a "generic" RstParser.parse_inline method, so it needs to handle more than just this special use case, i.e. you can't rely on the line being at the top of the source file

Got it, I'll try to add it, but not sure how to test it.

Well.. if this proposal is to be accepted, then really it needs to have proper "generic" tests, not just specific to the i18n use case, as obviously it could be used for other use cases

Ok I get it. Sorry I was too focused on the problem I was trying to solve, but you are right.

Regardless of the implementation details, what do you think about this proposal? As this is more of a proof of concept than a finished product.

Yeh I mean, coming from the MyST perspective, in principle I am certainly in favor of removing "rST hard-coded" aspects of the code base 😄 (the other big problematic aspect of sphinx for this is executablebooks/MyST-Parser#228)

But indeed, it is quite a "core" addition to sphinx, more broad reaching than just this use case,
so I would obviously want to be very careful (and have good agreement from other maintainers) before merging anything

Of course, I marked it as "draft" to make it clearer that it is not ready yet.

Fixes sphinx-doc#12287

…8n-logic

See <sphinx-doc#12287>

See <sphinx-doc#12277 (comment)>

n-peugnet added 6 commits April 7, 2024 17:25

WIP: New i18n logic based on inline_parse function

0ab14a0

Allows to not rely on strange hacks that are RST dependant. There is still an issue With the warning of missing literal block

Avoid 'Literal block expected; none found.' warnings

2682f8b

Update gettext builder output in tests

38c004b

parse_inline's input string is always a string

9a2510a

Only output a single paragraph in RSTParser's parse_inline

3c97944

So we trim the literal suffix to avoid warnings and we add it back at the end

Image nodes are handled separately, and literal can simply be manuall…

66b14d3

…y patched

n-peugnet mentioned this pull request Apr 7, 2024

Allow non-RST parsers to substitute the Locale transform #8852

Open

Fix ruff lints + more useful comment

5994ca5

n-peugnet mentioned this pull request Apr 7, 2024

Warning "local id not found in doc" in translated docs since MyST-parser 0.19.0 executablebooks/MyST-Parser#844

Open

chrisjsewell reviewed Apr 7, 2024

View reviewed changes

n-peugnet added 2 commits April 8, 2024 22:39

Fix last ruff error

0794348

Remove unused type:ignore annotation

358c611

n-peugnet marked this pull request as draft April 8, 2024 20:59

Simplify parse_inline literal block handling for rST

94b9b9d

This was referenced Apr 14, 2024

Add more i18n tests #12277

Merged

Translated parsed-literals are incorrectly rendered #12287

Open

n-peugnet added 4 commits April 15, 2024 23:25

Fix rendering for parsed-literals

17455d6

Fixes sphinx-doc#12287

Merge remote-tracking branch 'origin/master' into new-inline-parse-i1…

c10b0ec

…8n-logic

Regression test for parsed literals translation

0fc3f28

See <sphinx-doc#12287>

Regression test for strange markup

fe7675f

See <sphinx-doc#12277 (comment)>

chrisjsewell mentioned this pull request May 7, 2024

Allow access to the parser during the read phase #12361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser agnostic i18n post transform #12238

Parser agnostic i18n post transform #12238

n-peugnet commented Apr 7, 2024 •

edited

chrisjsewell Apr 7, 2024

n-peugnet Apr 7, 2024

n-peugnet Apr 7, 2024

chrisjsewell Apr 7, 2024

chrisjsewell Apr 7, 2024

chrisjsewell Apr 8, 2024

chrisjsewell Apr 8, 2024

n-peugnet Apr 8, 2024

chrisjsewell Apr 8, 2024 •

edited

n-peugnet Apr 8, 2024

Parser agnostic i18n post transform #12238

Are you sure you want to change the base?

Parser agnostic i18n post transform #12238

Conversation

n-peugnet commented Apr 7, 2024 • edited

Feature or Bugfix

Purpose

Detail

Relates

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrisjsewell Apr 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n-peugnet commented Apr 7, 2024 •

edited

chrisjsewell Apr 8, 2024 •

edited