Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ptexenc] Inconsistent error message #80

Closed
aminophen opened this issue May 29, 2019 · 10 comments
Closed

[ptexenc] Inconsistent error message #80

aminophen opened this issue May 29, 2019 · 10 comments
Labels

Comments

@aminophen
Copy link
Member

See texjporg/platex#84

@aminophen
Copy link
Member Author

the second "ſ" is converted to "^^c5^^bf".

This is reasonable enough, as "ſ" is "0xC5 0xBF" in UTF-8 byte sequence.

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.


However, I don't understand

! Package inputenc Error: Unicode character 顛 (U+C4CF)
(inputenc)                not set up for use with LaTeX.

why inputenc shows "U+C4CF".

@aminophen
Copy link
Member Author

When I read the comment from @JulienPalard, especially

e-pTeX 3.14159265-p3.8.1-180901-2.6 (utf8.euc) (TeX Live 2019/dev/Debian)
kpathsea version 6.3.1/dev
ptexenc version 1.3.7/dev
(from Debian Buster)
So, what did I get wrong?

It works with:
e-pTeX 3.14159265-p3.7.1-161114-2.6 (utf8.euc) (TeX Live 2017/Debian)
kpathsea version 6.2.3
ptexenc version 1.3.5
from Ubuntu bionic though.

initial thought was a change in pTeX behavior due to #34; however, it turned out to be irrelevant. My guess is: @JulienPalard thought "it worked with TeX Live 2017" because LaTeX ignored UTF-8 input instead of throwing an error. (FYI, the default processing of \usepackage[utf8]{inputenc} started in only TL2018, according to latex3/latex2e#24)


However, that does not answer my question: why does inputenc show "U+C4CF"?

@JulienPalard
Copy link

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable (I may not understand your sentence properly though), I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

@aminophen
Copy link
Member Author

aminophen commented May 29, 2019

I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

Practically you can try uplatex instead of platex; upLaTeX (upTeX) supports native Unicode characters and it has better compatibility with inputenc package. By design pLaTeX (pTeX) has limited support for Latin characters.

@aminophen
Copy link
Member Author

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable

What I meant by "reasonable" was the following: when interpreted favorably, it can be said that such a conversion is a design, because the origin of "顛" could be easily guessed from the behavior ("0xC5BF" of EUC-JP). --- Of course I'm not sure this is actually intended, though.


さて,ここからは日本語で書きます。

疑問点は以下の 2 個になりました。

  • (1) \message{ſ} でターミナルに表示されるはずの "ſ"(ソース中では UTF-8 のバイト列 "0xC5 0xBF")が漢字の "顛"(EUC-JP の "0xC5BF")に変換されたのはなぜ?
  • (2) inputenc パッケージ使用時のエラー "Unicode character 顛 (U+C4CF)" の "U+C4CF" はどこから来るのか?

(1) の方は,pTeX 3.1.4 で修正された

o ^^形式で入力された文字コードが漢字の第1バイトに当たる場合、
次の文字と共に漢字にしようとしてしまうのを修正。

の現象と同じではないのですが,なんだか似たにおいがします。

@h-kitagawa
Copy link
Member

(1) \message{ſ} でターミナルに表示されるはずの "ſ"(ソース中では UTF-8 のバイト列 "0xC5 0xBF")が漢字の "顛"(EUC-JP の "0xC5BF")に変換されたのはなぜ?

トークンの文字列化(や出力)で使われる printprint_kanji 関数がどのような引数で呼び出されたか調べてみました.その結果,

\message{ſ} % --> <c5><bf>顛
\message{顛} % --> [c5bf]顛

となっており,両者とも print_char(0xc5); print_char(0xbf); が呼び出されていることがわかりました.

pTeX では print_char に 0x80 以降を渡しても(和文文字出力のため)^^c5 の形にしないで出力していますが,「和文文字出力のために呼んだ print_char」かそうでないかでうまく分けられればなあ……と思っています.

@t-tk
Copy link
Collaborator

t-tk commented May 30, 2019

未検証ですが、おそらく現象は

  • 入力の 0xC5 0xBF (UTF-8のſ) が ptexenc で ^^c5^^bf に変換される。
  • 通常、本文で ^^c5^^bf が \usepackage[utf8]{inputenc} により LATIN SMALL LETTER LONG S に変換される。
  • \message{} の中では、print_char(0xc5); print_char(0xbf); の形で出力されるが ptexenc により EUC の 0xC5BF (顛) → UTF-8 の 顛 に変換され出力される。

思いつきの解決策の一案は、
8ビットのバイト列の場合は print_char(0xc5); print_char(0xbf); を呼び ptexenc での EUC→UTF-8変換をやらない。一方、EUCの和文の場合は print_kchar(0xc5bf); を呼び ptexenc での EUC→UTF-8変換をやる。

別の案は、
「8ビットバイト列のために呼んだ print_char」「和文文字出力のために呼んだ print_char」の場合を何らかのフラグで区別し ptexenc での EUC→UTF-8変換の有無を制御する。

前者は比較的正攻法だが改造量が増えそうです。後者は改造量は小さそうですが安普請かもしれません。上手くいくでしょうか。

@JulienPalard
Copy link

@aminophen Thanks for the recommandation of using uplatex, I did not previously heard of it.

Is it possible to use it with sphinx? I don't see it in the enum of latex_engine, and if I still try it I'm getting an error:

! LaTeX Error: This file needs format `pLaTeX2e'
               but this is `LaTeX2e'.

@aminophen
Copy link
Member Author

@JulienPalard I've never used Sphinx, but it seems uplatex is not supported now, according to sphinx-doc/sphinx#4186. You can join the discussion there, and you may get some information on how to add uplatex.

@aminophen
Copy link
Member Author

See #81: hope fixed on r61692

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants