Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

aphillips · 2024-02-19T18:42:15Z

This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace outside pattern text.
Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.

This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace **_outside_** pattern `text`. Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.

aphillips · 2024-02-19T18:52:29Z

@eggrobin For review

eggrobin · 2024-02-19T19:07:37Z

spec/syntax.md

+This definition of _whitespace_ implements 
+[UTR#31 Rule 3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
+It is a profile of R3a-1 in that specification because only the
+whitespace characters listed are permitted as whitespace.


UAX31-R2a-2 says you need to

define that profile with a precise specification of the characters that are added to or removed from the set of code points defined by the Pattern_White_Space property, and of any changes to the criteria under which a character or sequence of characters is interpreted as an end of line, as ignorable format controls, or as horizontal space.

The point is that the reader of such a conformance statement sees the difference from the default, which are the things that may need special attention when interoperating with an implementation based on the defaults.

So, if I am reading this right:

Form feed, next line, line separator, paragraph separator, and right-to-left mark are removed from the set of white space characters;

ideographic space is added to the set of white space characters;

the sole ignorable format control (LRM) is allowed only in contexts UAX31-I1 and UAX31-I3 (not UAX31-I2), where UAX31-I1 is further restricted to the beginning of a sequence of horizontal spaces, and UAX31-I3 is further restricted to only the end and not the beginning of a line.

further restricted to only the end and not the beginning of a line

Also it is disallowed on a blank line, I think?

Thanks @eggrobin. Technically, we don't have lines. Message whitespace can be normalized to a single space (in cases where whitespace is required) or to nothing (in cases where the whitespace is optional).

Your reading is correct. Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax. In all of the other cases in our syntax, there is required whitespace.

Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation.

I can correct the description to list the differences.

Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation.

Mostly that is something to be dealt with at a higher level (in editors and tooling); the main thing languages need to do is to treat the right things in the right way so that standardized tooling can deal with the issue, see UTS55.

Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax.

I don’t really understand what you mean here; UAX31-I2 does not say you should allow $⟨LRM⟩identifier, it says that you should allow it if you allow $ identifier.

The reason why you do not have UAX31-I2 is that you do not allow an LRM wherever you have [s], so for instance in

message-format-wg/spec/message.abnf

Lines 35 to 36 in ae712d7

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone

/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close

you do not allow {⟨LRM⟩#whatever⟨LRM⟩}, only {⟨LRM⟩ #whatever⟨LRM⟩ }, right?

you do not allow {⟨LRM⟩#whatever⟨LRM⟩}, only {⟨LRM⟩ #whatever⟨LRM⟩ }, right?

Note that this probably throws a wrench into the UTS55 conversion to plain text algorithm.

Yes. See my response to @gibson042's comment.

spec/syntax.md

eggrobin · 2024-02-19T19:59:08Z

spec/syntax.md

+ and the following character is not included in ignorable format controls:
+`U+200F RIGHT-TO-LEFT MARK`. 


I had suggested more thing, they are now obsolete.
``` the following character is not interpreted as whitespace (in particular, it is not treated as an ignorable format control): `U+200F RIGHT-TO-LEFT MARK`; the ignorable format control U+200E LEFT-TO-RIGHT mark is only allowed in contexts UAX31-I1 and UAX31-I3, further restricted to the beginning of a nonempty sequence of horizontal spaces and line terminators. ```

Technically, we don't have lines

You don’t, but the Unicode Standard does :-)

eggrobin · 2024-02-19T20:06:43Z

spec/syntax.md

+This definition of _whitespace_ implements 
+[UTR#31 Rule 3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
+It is a profile of R3a-1 in that specification because only the
+whitespace characters listed are permitted as whitespace.


Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation.

Mostly that is something to be dealt with at a higher level (in editors and tooling); the main thing languages need to do is to treat the right things in the right way so that standardized tooling can deal with the issue, see UTS55.

Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax.

I don’t really understand what you mean here; UAX31-I2 does not say you should allow $⟨LRM⟩identifier, it says that you should allow it if you allow $ identifier.

The reason why you do not have UAX31-I2 is that you do not allow an LRM wherever you have [s], so for instance in

message-format-wg/spec/message.abnf

Lines 35 to 36 in ae712d7

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone

/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close

you do not allow {⟨LRM⟩#whatever⟨LRM⟩}, only {⟨LRM⟩ #whatever⟨LRM⟩ }, right?

gibson042 · 2024-02-19T20:13:19Z

spec/message.abnf

-; Whitespace
-s = 1*( SP / HTAB / CR / LF / %x3000 )
+; Whitespace, optionally prepended with LRM
+s = [%x200E] 1*( SP / HTAB / CR / LF / %x3000 )


This means that LRM is not valid after a line terminator or when not followed by a whitespace character, e.g. before the quoted patterns in a message like

.match {$count :number} one {{You have {$count} notification.}} * {{You have {$count} notifications.}}

or

.match {$count :number} one {{You have {$count} notification.}} * {{You have {$count} notifications.}}

Is that intentional?

No, not really.

Probably what needs to happen here is a distinguishing of optional and non-optional whitespace. Everywhere we have [s] should use a production that can be just LRM (or RLM, fwiw) or nothing, e.g.:

; optional whitespace owsp = *( SP/ HTAB / CR / LF / %x3000 / %x200E / %x200F )

And everywhere that requires positive whitespace (i.e. just s) permit controls to either side:

; required whitespace wsp = [ (%x200E / %x200F) ] 1*( SP / HTAB / CR / LF / %x3000) [ (%x200E / %x200F) ]

I think that would be an improvement (and then you can drop the convoluted explanation from the conformance statement, and the profile limits itself to changing the sets of characters).

I did the hard change. We now need to get WG approval.

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

We use both separately.

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

eggrobin

The conformance statement seems good now.
There may be a discussion to be had as to whether the profile should be made smaller (in particular in the line break area discrepancies could lead to spoofing concerns), but this is probably not urgent.

In any case this clear conformance statement should facilitate CLDR-TC’s job when it comes to review—at least I should hope it will help the chair, who is my co-editor on UAX #‌31 :-)

spec/syntax.md

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

spec/syntax.md

macchiati · 2024-02-20T17:16:10Z

spec/syntax.md

+
+Tools SHOULD generate `U+200E LEFT-TO-RIGHT MARK` or `U+200F RIGHT-TO-LEFT MARK` 
+characters where permitted by the syntax before or following _identifiers_,
+_unquoted literals_, or _option_ values that use right-to-left characters 


I don't see why there is the restriction on unquoted literals. Shouldn't it be any literal? That is, anywhere a unquoted literal can appear, and unquoted one can. So it seems like either both (aka just 'literal') or neither can appear.

I could match on:

X y ⎨⎨$count⎬⎬

Where X is a RTL character, or on

⎸X⎸ y ⎨⎨$count⎬⎬

In both cases the result is jumbled (disregard the direction of the fake braces, the tool just reorders.

⎬ ⎬ y ⎨ ⎨ $ c o u n t X ⎬ ⎬ y ⎨ ⎨ $ c o u n t ⎸ X ⎸

Now, in this case I could put LRMs before the first ⎸ and after the second, because both positions allow whitespace. Are there circumstances around literals where it makes a difference in the insertability of LRM/RLM because of the quoting?

Other than that, the changes look good to me. Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues...

You're right. The key thing I think is to remind tool writers not to quote the mark onto the value.

Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues...

The changes are to the syntax and I think important enough to merit doing the change now--the better to stabilize the syntax. It does represent a relaxation of what is allowed in free whitespace. I would like to avoid having a lot of Tech Preview implementations reject bidi-friendlier messages in the fall.

OTOH, it does represent a departure from how we set up the s production.

Right, for a LTR reading, you want to put LRM around any 'element' that could contain RTL characters. Each literal being matched, literals in option values, etc. So in

{{STUFF {$value option=|JUNK| ...} TO READ}}

You want to insert like:

{{<LRM>STUFF {<LRM>$value option=<LRM>|JUNK|<LRM> ...<LRM>} TO READ<LRM>}}

Of course:

With LRM/RLM you can't get the the RTL message parts to reorder around the {$value}, but the order is predictable and far better than the raw message.

in practice you want tooling to do this, not humans.

{{<LRM>STUFF {<LRM>$value option=<LRM>|JUNK|<LRM> ...<LRM>} TO READ<LRM>}}

You definitely do not want the LRMs in the text part of the message, which is where the ones after the {{ and before the }} are. Instead there should be an LRM following the pattern so that any keys in the next variant aren't reordered:

<LRM>KEY1<LRM> KEY2<LRM> {{STUFF {<LRM>$value option=<LRM>|JUNK|<LRM>...<LRM>} TO READ}} <LRM>key1<LRM> key2<LRM> {{ ... next variant...}}

Agreed that this is a job for tools.

Whoops, yes, imediately outside {{ and }}, not inside.

macchiati · 2024-02-20T18:46:35Z

While a departure, I think it is cleaner than before....

…

On Tue, Feb 20, 2024 at 10:36 AM Addison Phillips ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spec/syntax.md <#673 (comment)> : > Inside _patterns_ and _quoted literals_, whitespace is part of the content and is recorded and stored verbatim. Whitespace is not significant outside translatable text, except where required by the syntax. +There are two whitespace productions in the syntax. +**_<dfn>Optional whitespace</dfn>_** is whitespace that is not required by the syntax, +but which users might want to include to increase the readability of a _message_. +**_<dfn>Required whitespace</dfn>_** is whitespace that is required by the syntax. + +Tools SHOULD generate `U+200E LEFT-TO-RIGHT MARK` or `U+200F RIGHT-TO-LEFT MARK` +characters where permitted by the syntax before or following _identifiers_, +_unquoted literals_, or _option_ values that use right-to-left characters Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues... The changes are to the syntax and I think important enough to merit doing the change now--the better to stabilize the syntax. It does represent a relaxation of what is allowed in free whitespace. I would like to avoid having a lot of Tech Preview implementations reject bidi-friendlier messages in the fall. OTOH, it does represent a departure from how we set up the s production. — Reply to this email directly, view it on GitHub <#673 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCY2JTR2URWSKJV3DDYUTUMRAVCNFSM6AAAAABDP5KVM6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQOJRGIYTONJZGQ> . You are receiving this because you commented.Message ID: ***@***.***>

eemeli · 2024-02-21T11:28:23Z

I'm a little puzzled about the explicit choice made here of only allowing RLM and LRM, as opposed to other directional formatting characters. Why is that preferable here to LRI/RLI/FSI/PDI, which we're exclusively using in our default bidi isolation strategy??

I've not spent very long (yet...) with bidi concerns, but one aspect that I'm concerned with is the implementation and understanding of the recommendations added by this change. To me, isolates seems more in line with the shape of MF2 syntax and its nestings of code and message text. As a programmer, they're also somewhat easier to reason about as their effects are more direct and, well, isolated.

I'm also a bit concerned about the effects that our allowance for RTL names and identifiers may have, esp. when they are mixed in with LTR names and identifiers, and our general allowance for newlines in whitespace.

Rather than allowing directional formatting characters in any/all whitespace, could we be much more restrictive about which characters we allow, and where? And also perhaps include some explicit guidance about the preferred rendering order for syntax, such as the LTR order within expressions of operand > function name > options > attributes?

For instance, would it be appropriate to only allow the following?

LRM after a syntax whitespace newline
LRI/RLI/FSI before the quoted-pattern start {{
PDI after the quoted-pattern end}}
LRI after the expression or markup start {
PDI before the expression or markup end }
LRI/RLI/FSI before variable or literal
LRI/RLI/FSI before the sigil (if any) prefixing an identifier
PDI after variable, literal, or identifier

The intent with the above would be to ensure that it's possible to have a valid message for which the "code" portions always have an LTR paragraph direction, while allowing for all user-customizable strings to define their own direction.

spec/syntax.md

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

spec/syntax.md

Co-authored-by: Mark Davis <mark@unicode.org>

macchiati · 2024-02-21T21:00:04Z

Looks great. Will approve once I'm back at my computer

…

On Wed, Feb 21, 2024, 12:08 Addison Phillips ***@***.***> wrote: @aphillips <https://github.com/aphillips> requested your review on: #673 <#673> Fix whitespace conformance to match UAX31 (including permitting LRM/RLM). — Reply to this email directly, view it on GitHub <#673 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCFWF4UXF4P4SCKXSTYUZH43AVCNFSM6AAAAABDP5KVM6VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRHA4DCNJZGA3DSMQ> . You are receiving this because your review was requested.Message ID: ***@***.*** com>

mihnita

Thank you!

spec/message.abnf

eemeli

Quoting @macchiati from #673 (comment):

It is quite tricky, and we should not derail the tech preview release for
this. That's why I (more strongly) urge that we capture this issue in a
note in the spec for tech preview, and make the fix afterwards.

I am not comfortable landing this in the timeframe required for LDML 45. I think we need to take the time to consider and address this properly, rather than rushing through a solution right now.

mihnita

I am split about this.
Looks OK, but incomplete.
We should also look at isolates, and it is a bit too close to deadline.

aphillips · 2024-02-26T17:21:48Z

@mihnita noted:

Looks OK, but incomplete.
We should also look at isolates, and it is a bit too close to deadline.

Can you clarify? What is incomplete? Also, this uses isolates, so your last comment is mysterious to me.

I agree that the deadlines are an issue. My concern here is that (a) we are a Unicode WG and this is a Unicode set of requirements. Even if we don't include it in 45, we need to deal with it and (b) syntax stability is important to me. Permitting bidi controls now in a somewhat (but not entirely!!) loose manner will prevent unnecessary churn later.

Anyway, to discuss in a few minutes in our call 😉

aphillips · 2024-02-26T19:23:03Z

WG will consider in post-45. @aphillips to create a 45-timed PR with a note.

The working group discussed accepting #673 for the tech preview. Because this was a late-breaking change, the group decided to work on incorporating work on bidi and UAX31 conformance in the early post-45 period. I was tasked with creating a PR with a note about bidi for the Tech Preview specifically. This note is adapted from text proposed in #673.

aphillips added 2 commits February 19, 2024 10:42

Allow LRM in whitespace

b387301

This partially addresses #661 by allowing the LRM character in message whitespace. This is whitespace **_outside_** pattern `text`. Tools can use this to help ensure that messages are formatted visually in a way consistent with LTR presentation of a message.

Update syntax.md

ae712d7

aphillips requested review from eemeli, gibson042 and stasm February 19, 2024 18:51

aphillips added syntax Issues related with MF Syntax specification Agenda+ labels Feb 19, 2024

eggrobin reviewed Feb 19, 2024

View reviewed changes

Make 3a-2 definition consistent with requirements

c556ecf

eggrobin reviewed Feb 19, 2024

View reviewed changes

gibson042 reviewed Feb 19, 2024

View reviewed changes

aphillips and others added 6 commits February 19, 2024 14:24

Update spec/syntax.md

8558eef

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

Replace [s] with owsp production

d830550

s != wsp

42b7744

We use both separately.

Update syntax.md

af75420

Update message.abnf

6a6dc70

Update spec/syntax.md

db5e97c

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

eggrobin approved these changes Feb 20, 2024

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

spec/syntax.md Outdated Show resolved Hide resolved

aphillips and others added 2 commits February 20, 2024 06:04

Update spec/syntax.md

3fede7a

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

Update spec/syntax.md

586a4d6

aphillips commented Feb 20, 2024

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

Update spec/syntax.md

502e1fe

aphillips changed the title ~~Allow LRM in whitespace~~ Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) Feb 20, 2024

macchiati reviewed Feb 20, 2024

View reviewed changes

Address literals

d7af5eb

Fix converting some s productions to wsp

68f32f8

aphillips requested review from macchiati, eggrobin and gibson042 February 21, 2024 17:29

eggrobin approved these changes Feb 21, 2024

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

macchiati requested changes Feb 21, 2024

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

Address @macchiati's comment

2532206

aphillips requested a review from macchiati February 21, 2024 19:18

Update spec/syntax.md

f985860

Co-authored-by: Robin Leroy <egg.robin.leroy@gmail.com>

macchiati reviewed Feb 21, 2024

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

aphillips and others added 2 commits February 21, 2024 11:56

Update spec/syntax.md

52c5c86

Co-authored-by: Mark Davis <mark@unicode.org>

Fix up @machiatti's suggested text

4a6c96c

aphillips requested a review from macchiati February 21, 2024 20:08

macchiati approved these changes Feb 21, 2024

View reviewed changes

mihnita reviewed Feb 22, 2024

View reviewed changes

spec/message.abnf Outdated Show resolved Hide resolved

spec/message.abnf Outdated Show resolved Hide resolved

eemeli requested changes Feb 22, 2024

View reviewed changes

Address @mihnita's suggestion, fix expression brackets

19864c3

mihnita reviewed Feb 26, 2024

View reviewed changes

Make syntax.md make abnf

27b9e42

aphillips added LDML46 Items that must be first for post-tech preview (LDML46) and removed Agenda+ labels Feb 26, 2024

aphillips mentioned this pull request Feb 26, 2024

Add note about bidi for Tech Preview period #692

Merged

aphillips mentioned this pull request Mar 5, 2024

Further front-door prep #705

Merged

This was referenced Mar 25, 2024

Create design doc for bidi support inside messages #746

Open

[DESIGN] Bidi usability #754

Merged

Merge branch 'main' into aphillips-allow-lrm

00e3796

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

aphillips commented Feb 19, 2024

aphillips commented Feb 19, 2024

eggrobin Feb 19, 2024

eggrobin Feb 19, 2024

aphillips Feb 19, 2024

eggrobin Feb 19, 2024

eggrobin Feb 19, 2024

aphillips Feb 19, 2024

eggrobin Feb 19, 2024 •

edited

eggrobin Feb 19, 2024

gibson042 Feb 19, 2024 •

edited

aphillips Feb 19, 2024

eggrobin Feb 19, 2024

aphillips Feb 20, 2024

eggrobin left a comment

macchiati Feb 20, 2024

macchiati Feb 20, 2024

aphillips Feb 20, 2024

aphillips Feb 20, 2024

macchiati Feb 20, 2024 •

edited

aphillips Feb 20, 2024

macchiati Feb 20, 2024

macchiati commented Feb 20, 2024 via email

eemeli commented Feb 21, 2024

macchiati commented Feb 21, 2024 via email

mihnita left a comment

eemeli left a comment

mihnita left a comment

aphillips commented Feb 26, 2024

aphillips commented Feb 26, 2024

	markup = "{" [s] "#" identifier (s option) (s attribute) [s] ["/"] "}" ; open and standalone
	/ "{" [s] "/" identifier (s option) (s attribute) [s] "}" ; close

		and the following character is not included in ignorable format controls:
		`U+200F RIGHT-TO-LEFT MARK`.

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Are you sure you want to change the base?

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Conversation

aphillips commented Feb 19, 2024

aphillips commented Feb 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin Feb 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibson042 Feb 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati Feb 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati commented Feb 20, 2024 via email

eemeli commented Feb 21, 2024

macchiati commented Feb 21, 2024 via email

mihnita left a comment

Choose a reason for hiding this comment

eemeli left a comment

Choose a reason for hiding this comment

mihnita left a comment

Choose a reason for hiding this comment

aphillips commented Feb 26, 2024

aphillips commented Feb 26, 2024

eggrobin Feb 19, 2024 •

edited

gibson042 Feb 19, 2024 •

edited

macchiati Feb 20, 2024 •

edited