[ptexenc] Inconsistent error message #80

aminophen · 2019-05-29T11:34:41Z

aminophen · 2019-05-29T11:49:13Z

the second "ſ" is converted to "^^c5^^bf".

This is reasonable enough, as "ſ" is "0xC5 0xBF" in UTF-8 byte sequence.

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

However, I don't understand

! Package inputenc Error: Unicode character 顛 (U+C4CF)
(inputenc)                not set up for use with LaTeX.

why inputenc shows "U+C4CF".

aminophen · 2019-05-29T12:51:29Z

When I read the comment from @JulienPalard, especially

e-pTeX 3.14159265-p3.8.1-180901-2.6 (utf8.euc) (TeX Live 2019/dev/Debian)
kpathsea version 6.3.1/dev
ptexenc version 1.3.7/dev
(from Debian Buster)
So, what did I get wrong?

It works with:
e-pTeX 3.14159265-p3.7.1-161114-2.6 (utf8.euc) (TeX Live 2017/Debian)
kpathsea version 6.2.3
ptexenc version 1.3.5
from Ubuntu bionic though.

initial thought was a change in pTeX behavior due to #34; however, it turned out to be irrelevant. My guess is: @JulienPalard thought "it worked with TeX Live 2017" because LaTeX ignored UTF-8 input instead of throwing an error. (FYI, the default processing of \usepackage[utf8]{inputenc} started in only TL2018, according to latex3/latex2e#24)

However, that does not answer my question: why does inputenc show "U+C4CF"?

JulienPalard · 2019-05-29T15:47:03Z

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable (I may not understand your sentence properly though), I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

aminophen · 2019-05-29T22:36:56Z

I'm working on a document written in UTF-8 having both CJK characters AND ſ (LATIN SMALL LETTER LONG S) used as an example, along with a kelvin sign and some others.

For reference, it's the PDF version of https://docs.python.org/ja/3/howto/regex.html#compilation-flags, so it's automatically generated Latex by Sphinx.

Practically you can try uplatex instead of platex; upLaTeX (upTeX) supports native Unicode characters and it has better compatibility with inputenc package. By design pLaTeX (pTeX) has limited support for Latin characters.

aminophen · 2019-05-30T09:16:20Z

The first "ſ" is converted to "顛"

This is also reasonable enough, as "顛" comes from "0xC5BF" of EUC-JP.

I'm not sure how it's reasonable

What I meant by "reasonable" was the following: when interpreted favorably, it can be said that such a conversion is a design, because the origin of "顛" could be easily guessed from the behavior ("0xC5BF" of EUC-JP). --- Of course I'm not sure this is actually intended, though.

さて，ここからは日本語で書きます。

疑問点は以下の 2 個になりました。

(1) \message{ſ} でターミナルに表示されるはずの "ſ"（ソース中では UTF-8 のバイト列 "0xC5 0xBF"）が漢字の "顛"（EUC-JP の "0xC5BF"）に変換されたのはなぜ？
(2) inputenc パッケージ使用時のエラー "Unicode character 顛 (U+C4CF)" の "U+C4CF" はどこから来るのか？

(1) の方は，pTeX 3.1.4 で修正された

o ^^形式で入力された文字コードが漢字の第1バイトに当たる場合、
次の文字と共に漢字にしようとしてしまうのを修正。

の現象と同じではないのですが，なんだか似たにおいがします。

h-kitagawa · 2019-05-30T12:15:04Z

(1) \message{ſ} でターミナルに表示されるはずの "ſ"（ソース中では UTF-8 のバイト列 "0xC5 0xBF"）が漢字の "顛"（EUC-JP の "0xC5BF"）に変換されたのはなぜ？

トークンの文字列化（や出力）で使われる print や print_kanji 関数がどのような引数で呼び出されたか調べてみました．その結果，

\message{ſ} % --> <c5><bf>顛
\message{顛} % --> [c5bf]顛

となっており，両者とも print_char(0xc5); print_char(0xbf); が呼び出されていることがわかりました．

pTeX では print_char に 0x80 以降を渡しても（和文文字出力のため）^^c5 の形にしないで出力していますが，「和文文字出力のために呼んだ print_char」かそうでないかでうまく分けられればなあ……と思っています．

t-tk · 2019-05-30T13:32:08Z

未検証ですが、おそらく現象は

入力の 0xC5 0xBF (UTF-8のſ) が ptexenc で ^^c5^^bf に変換される。
通常、本文で ^^c5^^bf が \usepackage[utf8]{inputenc} により LATIN SMALL LETTER LONG S に変換される。
\message{} の中では、print_char(0xc5); print_char(0xbf); の形で出力されるが ptexenc により EUC の 0xC5BF (顛) → UTF-8 の顛に変換され出力される。

思いつきの解決策の一案は、
8ビットのバイト列の場合は print_char(0xc5); print_char(0xbf); を呼び ptexenc での EUC→UTF-8変換をやらない。一方、EUCの和文の場合は print_kchar(0xc5bf); を呼び ptexenc での EUC→UTF-8変換をやる。

別の案は、
「8ビットバイト列のために呼んだ print_char」「和文文字出力のために呼んだ print_char」の場合を何らかのフラグで区別し ptexenc での EUC→UTF-8変換の有無を制御する。

前者は比較的正攻法だが改造量が増えそうです。後者は改造量は小さそうですが安普請かもしれません。上手くいくでしょうか。

JulienPalard · 2019-05-30T21:28:44Z

@aminophen Thanks for the recommandation of using uplatex, I did not previously heard of it.

Is it possible to use it with sphinx? I don't see it in the enum of latex_engine, and if I still try it I'm getting an error:

! LaTeX Error: This file needs format `pLaTeX2e'
               but this is `LaTeX2e'.

aminophen · 2019-05-30T21:39:26Z

@JulienPalard I've never used Sphinx, but it seems uplatex is not supported now, according to sphinx-doc/sphinx#4186. You can join the discussion there, and you may get some information on how to add uplatex.

aminophen · 2022-01-23T06:33:06Z

See #81: hope fixed on r61692

aminophen mentioned this issue May 29, 2019

Inconsistent error message texjporg/platex#84

Closed

aminophen added the question label May 29, 2019

h-kitagawa mentioned this issue Jun 8, 2019

バイト列と和文文字トークンの区別 #81

Closed

aminophen closed this as completed Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ptexenc] Inconsistent error message #80

[ptexenc] Inconsistent error message #80

aminophen commented May 29, 2019

aminophen commented May 29, 2019

aminophen commented May 29, 2019

JulienPalard commented May 29, 2019

aminophen commented May 29, 2019 •

edited

aminophen commented May 30, 2019

h-kitagawa commented May 30, 2019

t-tk commented May 30, 2019

JulienPalard commented May 30, 2019

aminophen commented May 30, 2019

aminophen commented Jan 23, 2022

[ptexenc] Inconsistent error message #80

[ptexenc] Inconsistent error message #80

Comments

aminophen commented May 29, 2019

aminophen commented May 29, 2019

aminophen commented May 29, 2019

JulienPalard commented May 29, 2019

aminophen commented May 29, 2019 • edited

aminophen commented May 30, 2019

h-kitagawa commented May 30, 2019

t-tk commented May 30, 2019

JulienPalard commented May 30, 2019

aminophen commented May 30, 2019

aminophen commented Jan 23, 2022

aminophen commented May 29, 2019 •

edited