Markdown: improve whitespace handling in CJK (#11597)

prettier · Oct 19, 2022 · ebb8e13 · ebb8e13
1 parent 92a6af2
commit ebb8e13
Show file tree

Hide file tree

Showing 12 changed files with 892 additions and 115 deletions.
diff --git a/changelog_unreleased/markdown/11597.md b/changelog_unreleased/markdown/11597.md
@@ -0,0 +1,104 @@
+#### [HIGHLIGHT] Improve handling of whitespace for Chinese, Japanese, and Korean (#11597 by @tats-u)
+
+##### Stop inserting spaces between Chinese or Japanese and Western characters
+
+Previously, Prettier would insert spaces between Chinese or Japanese and Western characters (letters and digits). While some people prefer this style, it isn’t standard, and is in fact contrary to official guidelines. Please see [here](https://github.com/tats-u/prettier-plugin-md-nocjsp#why-this-plugin-is-needed) for more details. We decided it’s not Prettier’s job to enforce a particular style in this case, so spaces aren’t inserted anymore, while existing ones are preserved. If you need a tool for enforcing spacing style, consider [textlint-ja](https://github.com/textlint-ja/textlint-rule-preset-ja-spacing/tree/master/packages/textlint-rule-ja-space-between-half-and-full-width) or [lint-md](https://github.com/lint-md/lint-md) (rules `space-round-alphabet` and `space-round-number`).
+
+The tricky part of this change were ambiguous line breaks between Chinese or Japanese and Western characters. When Prettier unwraps text, it needs to decide whether such a line break should be simply removed or replaced with a space. For that Prettier examines the surrounding text and infers the preferred style.
+
+<!-- prettier-ignore -->
+```markdown
+<!-- Input -->
+漢字
+Alphabetsひらがな12345カタカナ67890
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890
+
+<!-- Prettier stable -->
+漢字 Alphabets ひらがな 12345 カタカナ 67890
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890
+
+<!-- Prettier main -->
+漢字Alphabetsひらがな12345カタカナ67890
+
+漢字 Alphabets ひらがな 12345 カタカナ 67890
+```
+
+##### Comply to line breaking rules in Chinese and Japanese
+
+There are rules that prohibit certain characters from appearing at the beginning or the end of a line in [Chinese](https://www.w3.org/TR/clreq/#prohibition_rules_for_line_start_end) and [Japanese](https://www.w3.org/TR/jlreq/#characters_not_starting_a_line). E.g., full stop characters `。`, `．`, and `.` shouldn’t start a line whereas `（` shouldn’t end a line. Prettier now follows these rules when it wraps text, that is when `proseWrap` is set to `always`.
+
+<!-- prettier-ignore -->
+```markdown
+<!-- Input -->
+HTCPCPのエラー418は、ティーポットにコーヒーを淹（い）れさせようとしたときに返されるステータスコードだ。
+
+<!-- Prettier stable with --prose-wrap always --print-width 8 -->
+HTCPCP の
+エラー
+418 は、
+ティーポ
+ットにコ
+ーヒーを
+淹（い）
+れさせよ
+うとした
+ときに返
+されるス
+テータス
+コードだ
+。
+
+<!-- Prettier main with the same options -->
+HTCPCPの
+エラー
+418は、
+ティー
+ポットに
+コーヒー
+を淹
+（い）れ
+させよう
+としたと
+きに返さ
+れるス
+テータス
+コード
+だ。
+```
+
+##### Do not break lines inside Korean words
+
+Korean uses spaces to divide words, and an inappropriate division may change the meaning of a sentence:
+
+- `노래를 못해요.`: I’m not good at singing.
+- `노래를 못 해요.`: I can’t sing (for some reason).
+
+Previously, when `proseWrap` was set to `always`, successive Hangul characters could get split by a line break, which could later be converted to a space when the document is edited and reformatted. This doesn’t happen anymore. Korean text is now wrapped like English.
+
+<!-- prettier-ignore -->
+```markdown
+<!-- Input -->
+노래를 못해요.
+
+<!-- Prettier stable with --prose-wrap always --print-width 9 -->
+노래를 못
+해요.
+
+<!-- Prettier stable, subsequent reformat with --prose-wrap always --print-width 80 -->
+노래를 못 해요.
+
+<!-- Prettier main with --prose-wrap always --print-width 9 -->
+노래를
+못해요.
+
+<!-- Prettier main, subsequent reformat with --prose-wrap always --print-width 80 -->
+노래를 못해요.
+```
+
+A line break between Hangul and non-Hangul letters and digits is converted to a space when Prettier unwraps the text. Consider this example:
+
+> 3분 기다려 주지.
+
+In this sentence, if you break the line between “3” and “분”, a space will be inserted there when the text gets unwrapped.
diff --git a/cspell.json b/cspell.json
@@ -319,6 +319,7 @@
         "templating",
         "tempy",
         "testname",
+        "textlint",
         "tldr",
         "Tomasek",
         "toplevel",

diff --git a/src/language-markdown/print-preprocess.js b/src/language-markdown/print-preprocess.js
@@ -8,7 +8,7 @@ function preprocess(ast, options) {
   ast = mergeContinuousTexts(ast);
   ast = transformIndentedCodeblockAndMarkItsParentList(ast, options);
   ast = markAlignedList(ast, options);
-  ast = splitTextIntoSentences(ast, options);
+  ast = splitTextIntoSentences(ast);
   return ast;
 }
 
@@ -63,7 +63,7 @@ function mergeContinuousTexts(ast) {
   );
 }
 
-function splitTextIntoSentences(ast, options) {
+function splitTextIntoSentences(ast) {
   return mapAst(ast, (node, index, [parentNode]) => {
     if (node.type !== "text") {
       return node;
@@ -83,7 +83,7 @@ function splitTextIntoSentences(ast, options) {
     return {
       type: "sentence",
       position: node.position,
-      children: splitText(value, options),
+      children: splitText(value),
     };
   });
 }