Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document: don't allow characters with unicode property Bidi_Class in source files #10197

Merged
merged 1 commit into from Nov 4, 2022

Conversation

danarmak
Copy link
Contributor

@danarmak danarmak commented Oct 21, 2022

Update documentation with changes made in #10017. Previously discussed on scala-contributors.

This is my first PR and there are some things I'm not sure about:

  • The EBNF uses several CharNoXxxOrYyy constructions that are not explicitly defined. Since I couldn't update the definition I had to add ...OrZzz to their names. Zzz is here "BidiFormatting" which makes the names annoyingly long, and you can't keep adding exceptions this way. Is that alright? I considered shorter names but just "bidi" is misleading.
  • I linked to the Unicode standard definition of these codepoints. Is that a good thing? Should I additionally or instead explain why these particular characters are forbidden, linking to the CVE or some other discussion? (As for the Unicode link, I could make it even more explicit by explaining that these are the codepoints whose Bidi_Class has 'explicit strength' but most readers won't benefit from that.)
  • I noted that Scala 12 forbids these characters anywhere in the source file (e.g. comments, identifiers) - is that useful? It's an implementation restriction. Should I mark the PR [nomerge] and edit out that part in the 2.13 version?

@scala-jenkins scala-jenkins added this to the 2.12.18 milestone Oct 21, 2022
@SethTisue SethTisue added the documentation No code change. Only documentation label Oct 21, 2022
@SethTisue SethTisue self-requested a review October 26, 2022 12:00
@SethTisue SethTisue self-assigned this Oct 26, 2022
@SethTisue
Copy link
Member

SethTisue commented Oct 27, 2022

Welcome!

Is that alright? I considered shorter names but just "bidi" is misleading.

What you've done seems to me like the good choice under the circumstances. References to the long names are few. If we need to grow the names again in the future we could always reconsider at that point.

I linked to the Unicode standard definition of these codepoints. Is that a good thing?

Yes

Should I additionally or instead explain why these particular characters are forbidden

I'd say the PR would be mergeable either way. If you do decide to include a link and/or explanatory wording, it can be brief, as the topic is well covered elsewhere. (But note that from the spec, we don't link to particular scala/scala or scala/bug issues or PRs.)

I noted that Scala 12 forbids these characters anywhere in the source file (e.g. comments, identifiers) - is that useful? It's an implementation restriction. Should I mark the PR [nomerge] and edit out that part in the 2.13 version?

Though we do accept PRs against the 2.12 spec, we don't really care about the 2.12 spec anymore. The 2.13 spec is the one we actually care about. And the 2.13 spec should not include historical information about earlier versions; that would be considered clutter.

If you care to take the time to adjust both the 2.12 and 2.13 versions separately, fine, but we would also be perfectly content if you simply targeted the 2.13.x branch only.

@SethTisue SethTisue added the welcome hello new contributor! label Oct 27, 2022
@SethTisue SethTisue removed their assignment Oct 27, 2022
@SethTisue SethTisue marked this pull request as draft October 27, 2022 08:57
@danarmak danarmak changed the base branch from 2.12.x to 2.13.x October 27, 2022 12:07
@danarmak
Copy link
Contributor Author

danarmak commented Oct 27, 2022

I adjusted the PR for 2.13. Please adjust the milestone tag, etc.

@danarmak danarmak force-pushed the correct-string-literal-docs branch 2 times, most recently from 2e8382c to 88ae672 Compare October 27, 2022 12:11
@danarmak danarmak marked this pull request as ready for review October 27, 2022 12:12
@SethTisue SethTisue modified the milestones: 2.12.18, 2.13.11 Oct 27, 2022
spec/13-syntax-summary.md Outdated Show resolved Hide resolved
@@ -25,6 +25,9 @@ classes (Unicode general category given in parentheses):
1. Operator characters. These consist of all printable ASCII characters
(`\u0020` - `\u007E`) that are in none of the sets above, mathematical
symbols (`Sm`) and other symbols (`So`).
1. [Bidirectional explicit formatting](https://www.unicode.org/reports/tr9/#Bidirectional_Character_Types)
characters. The nine characters `\u202a - \u202e` and `\u2066 - \u2069`,
inclusive. These are forbidden from appearing in character and string literals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not correct, the characters are not allowed at all in source files. In string / character literals, the characters can be included as unicode escapes, but not literally.

In 2.12 (just for the record), the characters are also not allowed at all. The difference is that in unicode escapes (of these forbidden characters) are allowed everywhere (not only in character / string literals) in 2.12.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this wrong; the commit to 2.13 was different from the one to 2.12 and I thought 2.13 was more permissive, but it's the other way around. Fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this. It now seems a bit odd that this is in a list prefaced by saying "To construct tokens...". Do you think it should be moved to the first line of the section ("Scala source code consists of Unicode text" [and these characters are allowed])?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think the spec only needs to be changed in the first section. Something like this:

The nine Bidirectional explicit formatting characters \u202a - \u202e and \u2066 - \u2069 (inclusive) are forbidden from appearing in source files. Note that they can be represented using unicode escapes in string and character literals.

Paragraphs or sentences starting with "note" in the spec are clarifications that don't introduce new information. I think that's all that's needed here, no changes are needed in character / string literal sections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then do you think the other file (13-syntax-summary.md) doesn't need to be changed at all, and adding this note in this file is enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me that would be enough, but maybe others would prefer also a not in the syntax summary file? Things are not very precise already, for example stringElement ::= charNoDoubleQuoteOrNewline | escapeSeq but charNoDoubleQuoteOrNewline is not actually defined in the spec as far as I can tell...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noted the implicitly-defined xxxOrYyy constructions in the syntax summary (there are several of them) in a previous comment.

FWIW I would prefer a note in the syntax summary; that's where I looked first (and I was surprised that the summary is not an exact subset of the full grammar's productions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit that makes only that change to both files and does not change any grammar productions. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me that looks good, thank you!

@danarmak danarmak force-pushed the correct-string-literal-docs branch 2 times, most recently from 2abe09c to 4501c90 Compare October 31, 2022 13:11
@danarmak danarmak requested review from lrytz and removed request for SethTisue October 31, 2022 13:13
…source files

Update documentation with changes made in scala#10017
@som-snytt
Copy link
Contributor

som-snytt commented Oct 31, 2022

Thanks, I agree it didn't feel right as a production issue.

I recall that the deceptively simple "Scala source code consists of Unicode text." was subject to long discussions.

BTW, I just noticed that JLS uses AbutNotB style, but always with a definition. Not a great example but UnqualifiedMethodIdentifier := Identifier but not yield. There they didn't actually call it IdentifierNotYield.

@danarmak
Copy link
Contributor Author

I recall that the deceptively simple "Scala source code consists of Unicode text." was subject to long discussions.

Are there any other Unicode codepoints that are not allowed in Scala source code? Or did the discussions agree on allowing all of Unicode, and then this bidi issue came up?

@som-snytt
Copy link
Contributor

That change wasn't so long ago (mid-pandemic): b124a54

I only remember that sjrd stepped in with clarifying wording.

Yes, all of Unicode, but then this security issue came up, as the spirits do on Halloween. 🎃

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks you @danarmak

@lrytz lrytz merged commit 6190d9a into scala:2.13.x Nov 4, 2022
@danarmak danarmak deleted the correct-string-literal-docs branch November 6, 2022 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation No code change. Only documentation welcome hello new contributor!
Projects
None yet
5 participants