Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding of LatinRules.xdy (now UTF-8) #10655

Merged
merged 2 commits into from Jul 18, 2022

Conversation

alcrene
Copy link
Contributor

@alcrene alcrene commented Jul 10, 2022

Feature or Bugfix

  • Bugfix

Purpose

At the moment, sphinx/texinputs/LatinRules.xdy is encoded as
(as reported by file LatinRules.xdy):

Non-ISO extended-ASCII text, with LF, NEL line terminators

This has at least two consequences:

  1. More basic text readers display the file incorrectly. This includes
    the file's own raw view on GitHub.
  2. Certain pipelines may automatically convert this to UTF-8 (ostensibly
    a good thing), which causes headaches with diff and version control
    (500 spurious line changes every time the document is rebuilt).

This commit simply re-encodes the file to UTF-8. (Compare raw view after change.)

@jfbu
Copy link
Contributor

jfbu commented Jul 12, 2022

This file is for xindy. On my TeXLive installation, support files for xindy are in texmf-dist/xindy. Of particular relevance here are the files named utf8.xdy that one finds in each texmf-dist/xindy/modules/lang/<language>/. You can check that also via CTAN.

As an example here is a screen-shot of part of the utf8.xdy for Bulgarian, when I asked Emacs to load the file assuming it is utf-8:
Capture d’écran 2022-07-12 à 09 24 23

You see it maps Cyrillic letters to some non-ascii one-bytes. It seems to be a hard-coded thing in xindy that this is how it will know how to sort indices for languages not-limited to US-ascii.

What is LatinRules.xdy for? It is an addition by Sphinx which takes the utf8.xdy from <xindy>/modules/lang/general/utf8.xdy which is a scheme of xindy to account for all Latin languages in one-go and replace the one-bytes by choices not colliding with those used for Bulgarian and other Cyrillic languages.

Here is a screenshot from part of the <xindy>/modules/lang/general/utf8.xdy from xindy distribution:
Capture d’écran 2022-07-12 à 09 21 42

And similar one from LatinRules.xdy
Capture d’écran 2022-07-12 à 09 22 45

The objective is that Cyrillic language documents also indexing Latin words will have a correctly sorted index.

What your PR does will most likely completely break that. It is in the nature of this file to be structured the way it is. I don't know if Lisp has some equivalent to Python's '\x80'. If it had we could probably replace the file with one using it and looking sane as a UTF-8 text file.

I do not quite understand if your problem is with the Sphinx sources or with the presence of LatinRules.xdy in LaTeX build repertory of your project.

In the later case and if you don't use a Cyrillic language you can put this in your conf.py

latex_additional_files = ['LatinRules.xdy']

and create an empty LatinRules.xdy in the source of your project.

The exact list of languages is found at

XINDY_CYRILLIC_SCRIPTS = [
'be', 'bg', 'mk', 'mn', 'ru', 'sr', 'sh', 'uk',
]

One could imagine that Sphinx would not copy over LatinRules.xdy to LaTeX build directory except if the project language is one of the above. Currently it does and updates produced Makefile/make.bat according to a template which will react to whether language is one of the above.

Perhaps there is a way to add some marker to let file consider this to be a binary file? Again I am not knowledgeable enough.

Copy link
Contributor

@jfbu jfbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change will break the purpose of the file. See detailed comment above.

@alcrene
Copy link
Contributor Author

alcrene commented Jul 12, 2022

I see, thanks for explaining. Now the comments at the top of the file make a lot more sense.
My problem is just with the presence of LatinRules.xdy in the build directory, so your solution would work.

Would you accept a PR that adds something like the following to the comment block at the top ?

;; Note that this file uses non-standard ASCII encoding. This is intentional and required for its intended purpose.

Your other suggested solutions sound like good ideas to me but are outside my expertise.

@jfbu
Copy link
Contributor

jfbu commented Jul 12, 2022 via email

@jfbu
Copy link
Contributor

jfbu commented Jul 12, 2022 via email

@jfbu
Copy link
Contributor

jfbu commented Jul 13, 2022 via email

@jfbu
Copy link
Contributor

jfbu commented Jul 13, 2022

Would you accept a PR that adds something like the following to the comment block at the top ?

;; Note that this file uses non-standard ASCII encoding. This is intentional and required for its intended purpose.

@alcrene I would gladly merge something like the above, please hard-wrap at about 70 characters. Perhaps something like this

diff --git a/sphinx/texinputs/LatinRules.xdy b/sphinx/texinputs/LatinRules.xdy
index 99f14a2ee..8b6727779 100644
--- a/sphinx/texinputs/LatinRules.xdy
+++ b/sphinx/texinputs/LatinRules.xdy
@@ -1,6 +1,10 @@
-;; style file for xindy
+;; Common Lisp style file for xindy
 ;; filename: LatinRules.xdy
 ;;
+;; Please note that this is a data file using deliberately some
+;; strings with a single non-ascii bytes; this is intentional and
+;; follows the usage observed in similar xindy support files.
+;;
 ;; It is based upon xindy's files lang/general/utf8.xdy and
 ;; lang/general/utf8-lang.xdy which implement
 ;; "a general sorting order for Western European languages"

Thanks!

About

code-char http://clhs.lisp.se/Body/f_code_c.htm http://clhs.lisp.se/Body/f_code_c.htm seems to be our friend !

I may have been too optimistic as code-char is implementation dependent in Common Lisp as it depends on built-in character encoding. (code-char 182.) (my attempt to represent byte with unsigned code 0o266 produced Unicode as far as I could tell; I failled into mastering Common Lisp specifics in 3 times 5 minutes crash courses...)

@jfbu
Copy link
Contributor

jfbu commented Jul 13, 2022

+;; strings with a single non-ascii bytes; this is intentional and

byte not bytes or drop the a before... one sees why it is best I don't do it myself ;-)

@alcrene
Copy link
Contributor Author

alcrene commented Jul 14, 2022

Sounds good; I will do this. Looking into this a bit on my side, I found that one can tell git to treat certain files as binary, by adding a .gitattributes file with

*.xdy -text -merge

I believe this may have avoided the auto-conversion that landed me here, but in any case seems appropriate.

I would test this when I have a moment in the next few days, then update the PR with both the text change above and a new .gitattributes.

@jfbu
Copy link
Contributor

jfbu commented Jul 14, 2022

I believe this may have avoided the auto-conversion that landed me here, but in any case seems appropriate.

Could you tell what went wrong in your git set-up? I have never noticed problems in the past on my side. Are you on Windows?

@alcrene
Copy link
Contributor Author

alcrene commented Jul 14, 2022

No, on Linux, and locally it's fine.
I'm experimenting to see if I can build locally using Sphinx, apply some automated clean-up, then push to Overleaf. It's Overleaf which automatically converts to UTF-8.
The idea is that this would put the text on a platform my collaborators are comfortable with, and changes they make would show up as merge conflicts which I can resolve in the source. For now this is just a low priority experiment; if I can get it to work halfway sensibly, then I might prioritize it more.

@tk0miya tk0miya added builder:latex type:proposal a feature suggestion labels Jul 16, 2022
In order to fulfill its function, LatinRules.xdy must use single,
non-standard byte characters (neither ASCII, nor multi-byte UTF-8).
To someone encountering the file without knowing its purpose (e.g.
due a post-processing raising a warning for the unrecognized encoding)
this is likely surprising, and may seem like a holdover from a time where
Unicode wasn't as universally supported.

The added comment should make clear that the file must stay as it is,
and in particular that it must not be "standardized" to UTF-8.
@alcrene
Copy link
Contributor Author

alcrene commented Jul 17, 2022

Done; I've rebased and pushed the suggested text with minor changes.

Adding a .gitattributes unfortunately didn't improve anything on the Overleaf side – since it doesn't seem to solve any real-world problem, I left it out.

Copy link
Contributor

@jfbu jfbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution! Can you add a space as a sentence is ended by two spaces in the existing comments?

(sorry for nitpicking ;-) btw I notice that some sentences n pre-existing comments sometimes lack a full stop)

sphinx/texinputs/LatinRules.xdy Outdated Show resolved Hide resolved
@alcrene
Copy link
Contributor Author

alcrene commented Jul 18, 2022

No problem with nitpicking ;-).
I updated with the suggested changes.

Copy link
Contributor

@jfbu jfbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jfbu
Copy link
Contributor

jfbu commented Jul 18, 2022

Thanks a lot @alcrene for your contribution... and patience! I will merge after tests complete.

@jfbu jfbu merged commit ee13a0b into sphinx-doc:5.x Jul 18, 2022
jfbu added a commit that referenced this pull request Jul 18, 2022
@alcrene alcrene deleted the fix-LatinRules-encoding branch July 18, 2022 20:27
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
builder:latex type:proposal a feature suggestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants