Improve parsing of numbers in Kotlin lexer #1509

goodhoko · 2020-04-25T22:24:09Z

Adds support for underscore separator.
Add support for binary literals.
Removes invalid support for lowercase "l" suffix.
Adds support for unsigned literals.

pyrmont · 2020-04-26T00:25:12Z

@goodhoko Thanks for the PR! I realise it introduces a bit of duplication, but I thought from a maintainability perspective, it's a little bit easier to parse (as a human) the new rules if they're separated out. This is also more consistent with most other lexers and has the bonus benefit of allowing more targeted tokens.

goodhoko · 2020-04-26T08:08:14Z

@pyrmont Cool! Thanks. I added one more commit that replaces the generi Num rule with more specific Num::Float. This is ready to me merged from my side now. .)

pyrmont · 2020-04-26T08:14:59Z

@goodhoko I think we need to put the Num::Float rule last because the decimal point and exponent are all optional, right? So integers will match that rule if it comes first. Let me know what you think.

goodhoko · 2020-04-26T08:48:58Z

@pyrmont Exactly because the . and e parts of the Float and [bB] or [xX] parts of the Bin and Hex are optional the rules needs to come first. If the integer were first, it would 'eat up' the integer part of the float or hex literals breaking the whole number apart.

For example.
1e10 -> ["1", Int], ["e", Text], ["10", Int]

pyrmont · 2020-04-26T09:37:20Z

Whoops—you're completely right. What do you want to do? You can either make a float rule that will ensure there's at least a decimal point or an exponent or we can leave it as-is and just tokenise most integers as floats.

goodhoko · 2020-04-26T10:29:13Z

@pyrmont ah good catch! I didn't notice that the Float regexp would match integers too.

Since correctness is probably not the ultimate goal of Rouge I propose to revert back to b01aa87 and just accept we won't parse reals as floats but rather as general Number (along with some Integers too.)

But pursuing correctness is possible. Here's the correct regexp for reals:

((((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))))[fF])|(([\d][\d_]*[\d]|[\d])[fF]))|(((([\d][\d_]*[\d]|[\d])?\.([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d]))?)|(([\d][\d_]*[\d]|[\d])([eE][+-]?([\d][\d_]*[\d]|[\d])))))

Constructed recursively from the Grammar. Matches only Doubles or Floats.

To reduce the overwhelming complexity, we could break the regexp into parts (just as on the grammar page) and then use these parts in the rule definitions. For example

      decDigit =  %r"[0-9]"
      decDigitOrSeparator = %r"#{decDigit}|_"
      decDigits = %r"(#{decDigit}(#{decDigitOrSeparator})*#{decDigit})|#{decDigit}"
      doubleExponent = %r"[eE][+-]?(#{decDigits})"
      doubleLiteral = %r"((#{decDigits})?\.#{decDigits}(#{doubleExponent})?)|#{decDigits}#{doubleExponent}"
      floatLiteral = %r"(#{doubleLiteral}[fF])|(#{decDigits}[fF])"
      ...
      rule %r"(#{floatLiteral})|(#{doubleLiteral})", Num::Float
      # and so on for other Num literals

I'd gladly do that, I'm just not sure if this level of granularity is wanted here in Rouge. .)

pyrmont · 2020-04-26T10:34:35Z

The Rust lexer has arguably a decent compromise:

rouge/lib/rouge/lexers/rust.rb

Lines 164 to 175 in 2e9c3cd

    
                   # numbers 
        
                   dot = /[.][0-9_]+/ 
        
                   exp = /e[-+]?[0-9_]+/ 
        
                   flt = /f32|f64/ 
        
                   rule %r( 
        
                     [0-9]+ 
        
                     (#{dot}  #{exp}? #{flt}? 
        
                     |#{dot}? #{exp}  #{flt}? 
        
                     |#{dot}? #{exp}? #{flt} 
        
                     ) 
        
                   )x, Num::Float

We could do that here, too? Otherwise I don't mind going back to b01aa87.

goodhoko · 2020-04-27T10:59:34Z

@pyrmont cool. Gimme some time I'll try to rewrite this into something sane. .)

- Adds support for underscore separator. - Add support for binary literals. - Removes invalid support for lowercase "l" suffix. - Adds support for unsigned literals. - Adds support for dot-starting real literals.

goodhoko · 2020-05-03T11:24:13Z

@pyrmont I reworked this, please check it out.

The rules for numbers are now correct - they reflect the grammar with no shortcomings.
I believe the code complexity is reasonable. Note that Kotlin grammar is quite relaxed compared to Rust. Kotlin allows more forms of number literals which in turn requires longer rules.

I also moved the number rules above the Punctuation rules. Otherwise the dot of the dot-starting number literals (.99) would be parsed as Punctuation.

This commit improves the handling of numbers in the Kotlin lexer. Specifically, it: 1. adds support for underscore separators; 2. adds support for binary literals; 3. removes invalid support for the lowercase "l" suffix; and 4. adds support for unsigned literals.

goodhoko force-pushed the fix.kotlin.numbers branch from 8382068 to dfd565d Compare April 25, 2020 22:24

pyrmont self-assigned this Apr 26, 2020

pyrmont added the author-action The PR has been reviewed but action by the author is needed label Apr 26, 2020

Improve parsing of numbers in Kotlin lexer

9ee947c

- Adds support for underscore separator. - Add support for binary literals. - Removes invalid support for lowercase "l" suffix. - Adds support for unsigned literals. - Adds support for dot-starting real literals.

goodhoko force-pushed the fix.kotlin.numbers branch from 4b0cde4 to 9ee947c Compare May 3, 2020 11:13

pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels May 3, 2020

pyrmont merged commit 03be225 into rouge-ruby:master May 12, 2020

pyrmont removed the needs-review The PR needs to be reviewed label May 12, 2020

goodhoko deleted the fix.kotlin.numbers branch May 12, 2020 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing of numbers in Kotlin lexer #1509

Improve parsing of numbers in Kotlin lexer #1509

goodhoko commented Apr 25, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020 •

edited

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 27, 2020

goodhoko commented May 3, 2020 •

edited

Improve parsing of numbers in Kotlin lexer #1509

Improve parsing of numbers in Kotlin lexer #1509

Conversation

goodhoko commented Apr 25, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020 • edited

pyrmont commented Apr 26, 2020

goodhoko commented Apr 26, 2020

pyrmont commented Apr 26, 2020

goodhoko commented Apr 27, 2020

goodhoko commented May 3, 2020 • edited

goodhoko commented Apr 26, 2020 •

edited

goodhoko commented May 3, 2020 •

edited