Fix encoding of LatinRules.xdy (now UTF-8) #10655

alcrene · 2022-07-10T11:58:26Z

Feature or Bugfix

Bugfix

Purpose

At the moment, sphinx/texinputs/LatinRules.xdy is encoded as
(as reported by file LatinRules.xdy):

Non-ISO extended-ASCII text, with LF, NEL line terminators

This has at least two consequences:

More basic text readers display the file incorrectly. This includes
the file's own raw view on GitHub.
Certain pipelines may automatically convert this to UTF-8 (ostensibly
a good thing), which causes headaches with diff and version control
(500 spurious line changes every time the document is rebuilt).

This commit simply re-encodes the file to UTF-8. (Compare raw view after change.)

jfbu · 2022-07-12T07:57:31Z

This file is for xindy. On my TeXLive installation, support files for xindy are in texmf-dist/xindy. Of particular relevance here are the files named utf8.xdy that one finds in each texmf-dist/xindy/modules/lang/<language>/. You can check that also via CTAN.

As an example here is a screen-shot of part of the utf8.xdy for Bulgarian, when I asked Emacs to load the file assuming it is utf-8:

You see it maps Cyrillic letters to some non-ascii one-bytes. It seems to be a hard-coded thing in xindy that this is how it will know how to sort indices for languages not-limited to US-ascii.

What is LatinRules.xdy for? It is an addition by Sphinx which takes the utf8.xdy from <xindy>/modules/lang/general/utf8.xdy which is a scheme of xindy to account for all Latin languages in one-go and replace the one-bytes by choices not colliding with those used for Bulgarian and other Cyrillic languages.

Here is a screenshot from part of the <xindy>/modules/lang/general/utf8.xdy from xindy distribution:

And similar one from LatinRules.xdy

The objective is that Cyrillic language documents also indexing Latin words will have a correctly sorted index.

What your PR does will most likely completely break that. It is in the nature of this file to be structured the way it is. I don't know if Lisp has some equivalent to Python's '\x80'. If it had we could probably replace the file with one using it and looking sane as a UTF-8 text file.

I do not quite understand if your problem is with the Sphinx sources or with the presence of LatinRules.xdy in LaTeX build repertory of your project.

In the later case and if you don't use a Cyrillic language you can put this in your conf.py

latex_additional_files = ['LatinRules.xdy']

and create an empty LatinRules.xdy in the source of your project.

The exact list of languages is found at

sphinx/sphinx/builders/latex/__init__.py

Lines 92 to 94 in 9112cfe

    
           XINDY_CYRILLIC_SCRIPTS = [ 
        
               'be', 'bg', 'mk', 'mn', 'ru', 'sr', 'sh', 'uk', 
        
           ]

One could imagine that Sphinx would not copy over LatinRules.xdy to LaTeX build directory except if the project language is one of the above. Currently it does and updates produced Makefile/make.bat according to a template which will react to whether language is one of the above.

Perhaps there is a way to add some marker to let file consider this to be a binary file? Again I am not knowledgeable enough.

jfbu

The proposed change will break the purpose of the file. See detailed comment above.

alcrene · 2022-07-12T08:49:48Z

I see, thanks for explaining. Now the comments at the top of the file make a lot more sense.
My problem is just with the presence of LatinRules.xdy in the build directory, so your solution would work.

Would you accept a PR that adds something like the following to the comment block at the top ?

;; Note that this file uses non-standard ASCII encoding. This is intentional and required for its intended purpose.

Your other suggested solutions sound like good ideas to me but are outside my expertise.

jfbu · 2022-07-12T13:18:39Z

Hi, Not tested yet bu this gives me some hope:

You can specify characters by their hexadecimal character codes. A hexadecimal escape sequence consists of a backslash, ‘x’, and the hexadecimal character code. Thus, ‘?\x41’ is the character A, ‘?\x1’ is the character C-a, and ?\xe0 is the character à (a with grave accent). You can use any number of hex digits, so you can represent any character code in this way. You can specify characters by their character code in octal. An octal escape sequence consists of a backslash followed by up to three octal digits; thus, ‘?\101’ for the character A, ‘?\001’ for the character C-a, and ?\002 for the character C-b. Only characters up to octal code 777 can be specified this way.

from https://www.gnu.org/software/emacs/manual/html_node/elisp/General-Escape-Syntax.html <https://www.gnu.org/software/emacs/manual/html_node/elisp/General-Escape-Syntax.html> Sorry for my everlasting Lisp ignorance. It may be that I can convert byte with code (octal)-266 into simply ?\266 notation and everyone will be happy. Untested yet

jfbu · 2022-07-12T16:03:18Z

sadly it seems I have been confusing ELISP with Common Lisp

jfbu · 2022-07-13T07:32:47Z

code-char http://clhs.lisp.se/Body/f_code_c.htm <http://clhs.lisp.se/Body/f_code_c.htm> seems to be our friend ! (sorry for piece per piece messages but I currently have only little chunks of time; I so far only tested that (code-char 128.) did what I was hoping it would do accoding to doc in a common lisp test file but I did not test yet if that could be put as replacement of "<byte>" in the file defining the rules; if it does our problem is solved).

jfbu · 2022-07-13T16:23:21Z

Would you accept a PR that adds something like the following to the comment block at the top ?
;; Note that this file uses non-standard ASCII encoding. This is intentional and required for its intended purpose.

@alcrene I would gladly merge something like the above, please hard-wrap at about 70 characters. Perhaps something like this

diff --git a/sphinx/texinputs/LatinRules.xdy b/sphinx/texinputs/LatinRules.xdy
index 99f14a2ee..8b6727779 100644
--- a/sphinx/texinputs/LatinRules.xdy
+++ b/sphinx/texinputs/LatinRules.xdy
@@ -1,6 +1,10 @@
-;; style file for xindy
+;; Common Lisp style file for xindy
 ;; filename: LatinRules.xdy
 ;;
+;; Please note that this is a data file using deliberately some
+;; strings with a single non-ascii bytes; this is intentional and
+;; follows the usage observed in similar xindy support files.
+;;
 ;; It is based upon xindy's files lang/general/utf8.xdy and
 ;; lang/general/utf8-lang.xdy which implement
 ;; "a general sorting order for Western European languages"

Thanks!

About

code-char http://clhs.lisp.se/Body/f_code_c.htm http://clhs.lisp.se/Body/f_code_c.htm seems to be our friend !

I may have been too optimistic as code-char is implementation dependent in Common Lisp as it depends on built-in character encoding. (code-char 182.) (my attempt to represent byte with unsigned code 0o266 produced Unicode ¶ as far as I could tell; I failled into mastering Common Lisp specifics in 3 times 5 minutes crash courses...)

jfbu · 2022-07-13T16:24:34Z

+;; strings with a single non-ascii bytes; this is intentional and

byte not bytes or drop the a before... one sees why it is best I don't do it myself ;-)

alcrene · 2022-07-14T08:04:00Z

Sounds good; I will do this. Looking into this a bit on my side, I found that one can tell git to treat certain files as binary, by adding a .gitattributes file with

*.xdy -text -merge

I believe this may have avoided the auto-conversion that landed me here, but in any case seems appropriate.

I would test this when I have a moment in the next few days, then update the PR with both the text change above and a new .gitattributes.

jfbu · 2022-07-14T09:14:46Z

I believe this may have avoided the auto-conversion that landed me here, but in any case seems appropriate.

Could you tell what went wrong in your git set-up? I have never noticed problems in the past on my side. Are you on Windows?

alcrene · 2022-07-14T16:09:41Z

No, on Linux, and locally it's fine.
I'm experimenting to see if I can build locally using Sphinx, apply some automated clean-up, then push to Overleaf. It's Overleaf which automatically converts to UTF-8.
The idea is that this would put the text on a platform my collaborators are comfortable with, and changes they make would show up as merge conflicts which I can resolve in the source. For now this is just a low priority experiment; if I can get it to work halfway sensibly, then I might prioritize it more.

In order to fulfill its function, LatinRules.xdy must use single, non-standard byte characters (neither ASCII, nor multi-byte UTF-8). To someone encountering the file without knowing its purpose (e.g. due a post-processing raising a warning for the unrecognized encoding) this is likely surprising, and may seem like a holdover from a time where Unicode wasn't as universally supported. The added comment should make clear that the file must stay as it is, and in particular that it must not be "standardized" to UTF-8.

alcrene · 2022-07-17T19:09:07Z

Done; I've rebased and pushed the suggested text with minor changes.

Adding a .gitattributes unfortunately didn't improve anything on the Overleaf side – since it doesn't seem to solve any real-world problem, I left it out.

jfbu

Thanks for contribution! Can you add a space as a sentence is ended by two spaces in the existing comments?

(sorry for nitpicking ;-) btw I notice that some sentences n pre-existing comments sometimes lack a full stop)

sphinx/texinputs/LatinRules.xdy

alcrene · 2022-07-18T12:20:52Z

No problem with nitpicking ;-).
I updated with the suggested changes.

jfbu

LGTM, thanks!

jfbu · 2022-07-18T16:39:08Z

Thanks a lot @alcrene for your contribution... and patience! I will merge after tests complete.

jfbu reviewed Jul 12, 2022

View reviewed changes

tk0miya added builder:latex type:proposal a feature suggestion labels Jul 16, 2022

alcrene force-pushed the fix-LatinRules-encoding branch from b4a5ccb to 184c00e Compare July 17, 2022 19:02

jfbu approved these changes Jul 18, 2022

View reviewed changes

jfbu requested changes Jul 18, 2022

View reviewed changes

sphinx/texinputs/LatinRules.xdy Outdated Show resolved Hide resolved

alcrene force-pushed the fix-LatinRules-encoding branch from 905c8d3 to dadf2d4 Compare July 18, 2022 12:17

Fix punctuation in comments

37267fe

alcrene force-pushed the fix-LatinRules-encoding branch from dadf2d4 to 37267fe Compare July 18, 2022 12:18

jfbu approved these changes Jul 18, 2022

View reviewed changes

jfbu merged commit ee13a0b into sphinx-doc:5.x Jul 18, 2022

jfbu added a commit that referenced this pull request Jul 18, 2022

Update CHANGES for PR #10655

a340427

alcrene deleted the fix-LatinRules-encoding branch July 18, 2022 20:27

github-actions bot locked as resolved and limited conversation to collaborators Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding of LatinRules.xdy (now UTF-8) #10655

Fix encoding of LatinRules.xdy (now UTF-8) #10655

alcrene commented Jul 10, 2022

jfbu commented Jul 12, 2022

jfbu left a comment

alcrene commented Jul 12, 2022

jfbu commented Jul 12, 2022 via email

jfbu commented Jul 12, 2022 via email

jfbu commented Jul 13, 2022 via email

jfbu commented Jul 13, 2022

jfbu commented Jul 13, 2022

alcrene commented Jul 14, 2022

jfbu commented Jul 14, 2022

alcrene commented Jul 14, 2022

alcrene commented Jul 17, 2022

jfbu left a comment

alcrene commented Jul 18, 2022

jfbu left a comment

jfbu commented Jul 18, 2022

Fix encoding of LatinRules.xdy (now UTF-8) #10655

Fix encoding of LatinRules.xdy (now UTF-8) #10655

Conversation

alcrene commented Jul 10, 2022

Feature or Bugfix

Purpose

jfbu commented Jul 12, 2022

jfbu left a comment

Choose a reason for hiding this comment

alcrene commented Jul 12, 2022

jfbu commented Jul 12, 2022 via email

jfbu commented Jul 12, 2022 via email

jfbu commented Jul 13, 2022 via email

jfbu commented Jul 13, 2022

jfbu commented Jul 13, 2022

alcrene commented Jul 14, 2022

jfbu commented Jul 14, 2022

alcrene commented Jul 14, 2022

alcrene commented Jul 17, 2022

jfbu left a comment

Choose a reason for hiding this comment

alcrene commented Jul 18, 2022

jfbu left a comment

Choose a reason for hiding this comment

jfbu commented Jul 18, 2022