Skip to content

Commit

Permalink
Markdown: improve whitespace handling in CJK (#11597)
Browse files Browse the repository at this point in the history
  • Loading branch information
tats-u committed Oct 19, 2022
1 parent 92a6af2 commit ebb8e13
Show file tree
Hide file tree
Showing 12 changed files with 892 additions and 115 deletions.
104 changes: 104 additions & 0 deletions changelog_unreleased/markdown/11597.md
@@ -0,0 +1,104 @@
#### [HIGHLIGHT] Improve handling of whitespace for Chinese, Japanese, and Korean (#11597 by @tats-u)

##### Stop inserting spaces between Chinese or Japanese and Western characters

Previously, Prettier would insert spaces between Chinese or Japanese and Western characters (letters and digits). While some people prefer this style, it isn’t standard, and is in fact contrary to official guidelines. Please see [here](https://github.com/tats-u/prettier-plugin-md-nocjsp#why-this-plugin-is-needed) for more details. We decided it’s not Prettier’s job to enforce a particular style in this case, so spaces aren’t inserted anymore, while existing ones are preserved. If you need a tool for enforcing spacing style, consider [textlint-ja](https://github.com/textlint-ja/textlint-rule-preset-ja-spacing/tree/master/packages/textlint-rule-ja-space-between-half-and-full-width) or [lint-md](https://github.com/lint-md/lint-md) (rules `space-round-alphabet` and `space-round-number`).

The tricky part of this change were ambiguous line breaks between Chinese or Japanese and Western characters. When Prettier unwraps text, it needs to decide whether such a line break should be simply removed or replaced with a space. For that Prettier examines the surrounding text and infers the preferred style.

<!-- prettier-ignore -->
```markdown
<!-- Input -->
漢字
Alphabetsひらがな12345カタカナ67890

漢字 Alphabets ひらがな 12345 カタカナ 67890

<!-- Prettier stable -->
漢字 Alphabets ひらがな 12345 カタカナ 67890

漢字 Alphabets ひらがな 12345 カタカナ 67890

<!-- Prettier main -->
漢字Alphabetsひらがな12345カタカナ67890

漢字 Alphabets ひらがな 12345 カタカナ 67890
```

##### Comply to line breaking rules in Chinese and Japanese

There are rules that prohibit certain characters from appearing at the beginning or the end of a line in [Chinese](https://www.w3.org/TR/clreq/#prohibition_rules_for_line_start_end) and [Japanese](https://www.w3.org/TR/jlreq/#characters_not_starting_a_line). E.g., full stop characters ``, ``, and `.` shouldn’t start a line whereas `` shouldn’t end a line. Prettier now follows these rules when it wraps text, that is when `proseWrap` is set to `always`.

<!-- prettier-ignore -->
```markdown
<!-- Input -->
HTCPCPのエラー418は、ティーポットにコーヒーを淹(い)れさせようとしたときに返されるステータスコードだ。

<!-- Prettier stable with --prose-wrap always --print-width 8 -->
HTCPCP の
エラー
418 は、
ティーポ
ットにコ
ーヒーを
淹(い)
れさせよ
うとした
ときに返
されるス
テータス
コードだ

<!-- Prettier main with the same options -->
HTCPCPの
エラー
418は、
ティー
ポットに
コーヒー
を淹
(い)れ
させよう
としたと
きに返さ
れるス
テータス
コード
だ。
```

##### Do not break lines inside Korean words

Korean uses spaces to divide words, and an inappropriate division may change the meaning of a sentence:

- `노래를 못해요.`: I’m not good at singing.
- `노래를 못 해요.`: I can’t sing (for some reason).

Previously, when `proseWrap` was set to `always`, successive Hangul characters could get split by a line break, which could later be converted to a space when the document is edited and reformatted. This doesn’t happen anymore. Korean text is now wrapped like English.

<!-- prettier-ignore -->
```markdown
<!-- Input -->
노래를 못해요.

<!-- Prettier stable with --prose-wrap always --print-width 9 -->
노래를 못
해요.

<!-- Prettier stable, subsequent reformat with --prose-wrap always --print-width 80 -->
노래를 못 해요.

<!-- Prettier main with --prose-wrap always --print-width 9 -->
노래를
못해요.

<!-- Prettier main, subsequent reformat with --prose-wrap always --print-width 80 -->
노래를 못해요.
```

A line break between Hangul and non-Hangul letters and digits is converted to a space when Prettier unwraps the text. Consider this example:

> 3분 기다려 주지.
In this sentence, if you break the line between “3” and “분”, a space will be inserted there when the text gets unwrapped.
1 change: 1 addition & 0 deletions cspell.json
Expand Up @@ -319,6 +319,7 @@
"templating",
"tempy",
"testname",
"textlint",
"tldr",
"Tomasek",
"toplevel",
Expand Down
6 changes: 3 additions & 3 deletions src/language-markdown/print-preprocess.js
Expand Up @@ -8,7 +8,7 @@ function preprocess(ast, options) {
ast = mergeContinuousTexts(ast);
ast = transformIndentedCodeblockAndMarkItsParentList(ast, options);
ast = markAlignedList(ast, options);
ast = splitTextIntoSentences(ast, options);
ast = splitTextIntoSentences(ast);
return ast;
}

Expand Down Expand Up @@ -63,7 +63,7 @@ function mergeContinuousTexts(ast) {
);
}

function splitTextIntoSentences(ast, options) {
function splitTextIntoSentences(ast) {
return mapAst(ast, (node, index, [parentNode]) => {
if (node.type !== "text") {
return node;
Expand All @@ -83,7 +83,7 @@ function splitTextIntoSentences(ast, options) {
return {
type: "sentence",
position: node.position,
children: splitText(value, options),
children: splitText(value),
};
});
}
Expand Down

0 comments on commit ebb8e13

Please sign in to comment.