New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding of LatinRules.xdy (now UTF-8) #10655
Conversation
This file is for xindy. On my TeXLive installation, support files for As an example here is a screen-shot of part of the You see it maps Cyrillic letters to some non-ascii one-bytes. It seems to be a hard-coded thing in xindy that this is how it will know how to sort indices for languages not-limited to US-ascii. What is LatinRules.xdy for? It is an addition by Sphinx which takes the Here is a screenshot from part of the And similar one from LatinRules.xdy The objective is that Cyrillic language documents also indexing Latin words will have a correctly sorted index. What your PR does will most likely completely break that. It is in the nature of this file to be structured the way it is. I don't know if Lisp has some equivalent to Python's I do not quite understand if your problem is with the Sphinx sources or with the presence of In the later case and if you don't use a Cyrillic language you can put this in your
and create an empty The exact list of languages is found at sphinx/sphinx/builders/latex/__init__.py Lines 92 to 94 in 9112cfe
One could imagine that Sphinx would not copy over LatinRules.xdy to LaTeX build directory except if the project language is one of the above. Currently it does and updates produced Makefile/make.bat according to a template which will react to whether language is one of the above. Perhaps there is a way to add some marker to let |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed change will break the purpose of the file. See detailed comment above.
I see, thanks for explaining. Now the comments at the top of the file make a lot more sense. Would you accept a PR that adds something like the following to the comment block at the top ?
Your other suggested solutions sound like good ideas to me but are outside my expertise. |
Hi,
Not tested yet bu this gives me some hope:
You can specify characters by their hexadecimal character codes. A hexadecimal escape sequence consists of a backslash, ‘x’, and the hexadecimal character code. Thus, ‘?\x41’ is the character A, ‘?\x1’ is the character C-a, and ?\xe0 is the character à (a with grave accent). You can use any number of hex digits, so you can represent any character code in this way.
You can specify characters by their character code in octal. An octal escape sequence consists of a backslash followed by up to three octal digits; thus, ‘?\101’ for the character A, ‘?\001’ for the character C-a, and ?\002 for the character C-b. Only characters up to octal code 777 can be specified this way.
from https://www.gnu.org/software/emacs/manual/html_node/elisp/General-Escape-Syntax.html <https://www.gnu.org/software/emacs/manual/html_node/elisp/General-Escape-Syntax.html>
Sorry for my everlasting Lisp ignorance. It may be that I can convert byte with code (octal)-266 into simply ?\266 notation and everyone will be happy.
Untested yet
|
sadly it seems I have been confusing ELISP with Common Lisp
|
code-char http://clhs.lisp.se/Body/f_code_c.htm <http://clhs.lisp.se/Body/f_code_c.htm> seems to be our friend ! (sorry for piece per piece messages but I currently have only little chunks of time; I so far only tested that (code-char 128.) did what I was hoping it would do accoding to doc in a common lisp test file but I did not test yet if that could be put as replacement of "<byte>" in the file defining the rules; if it does our problem is solved).
|
@alcrene I would gladly merge something like the above, please hard-wrap at about 70 characters. Perhaps something like this diff --git a/sphinx/texinputs/LatinRules.xdy b/sphinx/texinputs/LatinRules.xdy
index 99f14a2ee..8b6727779 100644
--- a/sphinx/texinputs/LatinRules.xdy
+++ b/sphinx/texinputs/LatinRules.xdy
@@ -1,6 +1,10 @@
-;; style file for xindy
+;; Common Lisp style file for xindy
;; filename: LatinRules.xdy
;;
+;; Please note that this is a data file using deliberately some
+;; strings with a single non-ascii bytes; this is intentional and
+;; follows the usage observed in similar xindy support files.
+;;
;; It is based upon xindy's files lang/general/utf8.xdy and
;; lang/general/utf8-lang.xdy which implement
;; "a general sorting order for Western European languages" Thanks! About
I may have been too optimistic as |
byte not bytes or drop the |
Sounds good; I will do this. Looking into this a bit on my side, I found that one can tell git to treat certain files as binary, by adding a
I believe this may have avoided the auto-conversion that landed me here, but in any case seems appropriate. I would test this when I have a moment in the next few days, then update the PR with both the text change above and a new |
Could you tell what went wrong in your git set-up? I have never noticed problems in the past on my side. Are you on Windows? |
No, on Linux, and locally it's fine. |
In order to fulfill its function, LatinRules.xdy must use single, non-standard byte characters (neither ASCII, nor multi-byte UTF-8). To someone encountering the file without knowing its purpose (e.g. due a post-processing raising a warning for the unrecognized encoding) this is likely surprising, and may seem like a holdover from a time where Unicode wasn't as universally supported. The added comment should make clear that the file must stay as it is, and in particular that it must not be "standardized" to UTF-8.
b4a5ccb
to
184c00e
Compare
Done; I've rebased and pushed the suggested text with minor changes. Adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contribution! Can you add a space as a sentence is ended by two spaces in the existing comments?
(sorry for nitpicking ;-) btw I notice that some sentences n pre-existing comments sometimes lack a full stop)
905c8d3
to
dadf2d4
Compare
dadf2d4
to
37267fe
Compare
No problem with nitpicking ;-). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Thanks a lot @alcrene for your contribution... and patience! I will merge after tests complete. |
Feature or Bugfix
Purpose
At the moment, sphinx/texinputs/LatinRules.xdy is encoded as
(as reported by
file LatinRules.xdy
):This has at least two consequences:
the file's own raw view on GitHub.
a good thing), which causes headaches with diff and version control
(500 spurious line changes every time the document is rebuilt).
This commit simply re-encodes the file to UTF-8. (Compare raw view after change.)