Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of numbers in Kotlin lexer #1509

Merged
merged 1 commit into from May 12, 2020

Conversation

goodhoko
Copy link
Contributor

  • Adds support for underscore separator.
  • Add support for binary literals.
  • Removes invalid support for lowercase "l" suffix.
  • Adds support for unsigned literals.

@pyrmont
Copy link
Contributor

pyrmont commented Apr 26, 2020

@goodhoko Thanks for the PR! I realise it introduces a bit of duplication, but I thought from a maintainability perspective, it's a little bit easier to parse (as a human) the new rules if they're separated out. This is also more consistent with most other lexers and has the bonus benefit of allowing more targeted tokens.

@goodhoko
Copy link
Contributor Author

@pyrmont Cool! Thanks. I added one more commit that replaces the generi Num rule with more specific Num::Float. This is ready to me merged from my side now. .)

@pyrmont
Copy link
Contributor

pyrmont commented Apr 26, 2020

@goodhoko I think we need to put the Num::Float rule last because the decimal point and exponent are all optional, right? So integers will match that rule if it comes first. Let me know what you think.

@goodhoko
Copy link
Contributor Author

goodhoko commented Apr 26, 2020

@pyrmont Exactly because the . and e parts of the Float and [bB] or [xX] parts of the Bin and Hex are optional the rules needs to come first. If the integer were first, it would 'eat up' the integer part of the float or hex literals breaking the whole number apart.

For example.
1e10 -> ["1", Int], ["e", Text], ["10", Int]

@pyrmont
Copy link
Contributor

pyrmont commented Apr 26, 2020

Whoops—you're completely right. What do you want to do? You can either make a float rule that will ensure there's at least a decimal point or an exponent or we can leave it as-is and just tokenise most integers as floats.

@goodhoko
Copy link
Contributor Author

@pyrmont ah good catch! I didn't notice that the Float regexp would match integers too.

Since correctness is probably not the ultimate goal of Rouge I propose to revert back to b01aa87 and just accept we won't parse reals as floats but rather as general Number (along with some Integers too.)

But pursuing correctness is possible. Here's the correct regexp for reals:

((((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))))[fF])|(([\d][\d_]*[\d]|[\d])[fF]))|(((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d])))))

Constructed recursively from the Grammar. Matches only Doubles or Floats.

To reduce the overwhelming complexity, we could break the regexp into parts (just as on the grammar page) and then use these parts in the rule definitions. For example

      decDigit =  %r"[0-9]"
      decDigitOrSeparator = %r"#{decDigit}|_"
      decDigits = %r"(#{decDigit}(#{decDigitOrSeparator})*#{decDigit})|#{decDigit}"
      doubleExponent = %r"[eE][+-]?(#{decDigits})"
      doubleLiteral = %r"((#{decDigits})?\.#{decDigits}(#{doubleExponent})?)|#{decDigits}#{doubleExponent}"
      floatLiteral = %r"(#{doubleLiteral}[fF])|(#{decDigits}[fF])"
      ...
      rule %r"(#{floatLiteral})|(#{doubleLiteral})", Num::Float
      # and so on for other Num literals

I'd gladly do that, I'm just not sure if this level of granularity is wanted here in Rouge. .)

@pyrmont
Copy link
Contributor

pyrmont commented Apr 26, 2020

The Rust lexer has arguably a decent compromise:

# numbers
dot = /[.][0-9_]+/
exp = /e[-+]?[0-9_]+/
flt = /f32|f64/
rule %r(
[0-9]+
(#{dot} #{exp}? #{flt}?
|#{dot}? #{exp} #{flt}?
|#{dot}? #{exp}? #{flt}
)
)x, Num::Float

We could do that here, too? Otherwise I don't mind going back to b01aa87.

@pyrmont pyrmont self-assigned this Apr 26, 2020
@pyrmont pyrmont added the author-action The PR has been reviewed but action by the author is needed label Apr 26, 2020
@goodhoko
Copy link
Contributor Author

@pyrmont cool. Gimme some time I'll try to rewrite this into something sane. .)

- Adds support for underscore separator.
- Add support for binary literals.
- Removes invalid support for lowercase "l" suffix.
- Adds support for unsigned literals.
- Adds support for dot-starting real literals.
@goodhoko
Copy link
Contributor Author

goodhoko commented May 3, 2020

@pyrmont I reworked this, please check it out.

The rules for numbers are now correct - they reflect the grammar with no shortcomings.
I believe the code complexity is reasonable. Note that Kotlin grammar is quite relaxed compared to Rust. Kotlin allows more forms of number literals which in turn requires longer rules.

I also moved the number rules above the Punctuation rules. Otherwise the dot of the dot-starting number literals (.99) would be parsed as Punctuation.

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels May 3, 2020
@pyrmont pyrmont merged commit 03be225 into rouge-ruby:master May 12, 2020
@pyrmont pyrmont removed the needs-review The PR needs to be reviewed label May 12, 2020
@goodhoko goodhoko deleted the fix.kotlin.numbers branch May 12, 2020 20:21
mattt pushed a commit to NSHipster/rouge that referenced this pull request May 21, 2020
This commit improves the handling of numbers in the Kotlin lexer.
Specifically, it:

1. adds support for underscore separators;
2. adds support for binary literals;
3. removes invalid support for the lowercase "l" suffix; and
4. adds support for unsigned literals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants