Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues on large inputs #918

Open
prikha opened this issue Mar 24, 2023 · 5 comments
Open

Performance issues on large inputs #918

prikha opened this issue Mar 24, 2023 · 5 comments

Comments

@prikha
Copy link

prikha commented Mar 24, 2023

Noticed that parsing time has a worse than linear dependency on the input size. To illustrate it, I've generated a few files including ruby hashes growing in size in constant steps.

Results:

image

Profiling shows that we're spending most of the time in lexer's advance function which in turn spends 2/3 of it's time in Buffer#slice which in turn calls String#encode function.

PARSER

Unfortunately I have no good ideas on how to solve the performance issue. It looks like fixing it could bring substantial speedup to a number of projects including widely popular rubocop. WDYT?

@iliabylich
Copy link
Collaborator

Do you have non-ASCII characters in your Ruby code? I guess that's the reason why parser does re-encoding for every string token.

If you have only ASCII characters then it's safe to assume that any part of your input (i.e. any range/sub-string) is also valid.

However it's not true for multibyte encodings like UTF-8. Lexer operates on bytes, and so when it constructs a token it uses byte offsets to define its location. This arbitrary range can be invalid, for example for a byte string

[char1_byte1, char1_byte2, char2_byte1, char2_byte2]

it's valid to use ranges 0..2, 2..4 and 0..4, but all other non-empty ranges are invalid, this why re-encoding (with validation under the hood) is necessary for every token, and this is why it's slow.

Could you please check if it also happens if you have only ASCII characters in your input?

@prikha
Copy link
Author

prikha commented Mar 24, 2023

Ok, that's actually explains it a bit. Basically:

code_raw = File.read(path);
puts Benchmark.measure { Parser::CurrentRuby.parse(code_raw) }
code_ascii = code_raw.encode('ASCII', 'UTF-8', undef: :replace);
puts Benchmark.measure {  Parser::CurrentRuby.parse(code_ascii) }

Giving:

raw   33.280141   0.396964  33.677105 ( 33.728819)
ascii   16.948894   0.119137  17.068031 ( 17.089580)

Which is a 2x improvement. And indeed it completely changes the flamegraph:

sample

So my problem actually split in 2:

  • ✅ it's best to encode strings to ASCII(if applicable) before parsing
  • ❔ parser is still having worse than linear dependency on the input size on my payloads

image

@levidavidmurray
Copy link

levidavidmurray commented Apr 6, 2023

FWIW, there was supposedly a significant performance decrease between 3.0.2.0 and 3.0.3.2. I don't interface with parser directly, but we utilize i18n-tasks (which uses parser) to find missing translation keys in our rails app.

See this open issue from over a year ago: glebm/i18n-tasks#407

We have couple tests using i18n-tasks ( https://gist.github.com/manaka/2a6f32b11853f1f31709247ed682403f )
After update from 0.9.35 to 0.9.36 it takes 73 minutes to finish this tests, before update it takes 1 and half minute.

I find out that parser update (3.0.2.0 → 3.0.3.2) is breaking working before 0.9.35

When I downgraded parser from 3.1.3.0 to 3.0.2.0 I saw our I18n::Tasks::MissingKeys times go from 210 seconds to only 10 seconds.

@iliabylich
Copy link
Collaborator

This method probably could be optimised by a lot according to your benchmark profile (and IIRC this loop has been added in 3.0.2.0).

At first you could try removing the loop locally to see if improves performance and if so feel free to send a PR.

@glebm
Copy link

glebm commented Apr 7, 2023

The commit (547d731) says it was introduced in v3.0.3.0:

image

A quadratic loop to detect duplicate keys is the likely root cause. Rails translation files can have thousands of keys in the same hash.

dgollahon added a commit to dgollahon/parser that referenced this issue Apr 15, 2023
- Replaces O(n^2) processing loop with a set. I benchmarked parsing a file with a single 5000 key hash to test this change's performance. Before:

**Before**

```sh
$ hyperfine --warmup 3 'bin/ruby-parse hash_bench.rb'

Benchmark 1: bin/ruby-parse hash_bench.rb
  Time (mean ± σ):      2.678 s ±  0.029 s    [User: 2.613 s, System: 0.061 s]
  Range (min … max):    2.636 s …  2.738 s    10 runs
```

**After**

```sh
$ hyperfine --warmup 3 'bin/ruby-parse hash_bench.rb'

Benchmark 1: bin/ruby-parse hash_bench.rb
  Time (mean ± σ):     421.3 ms ±   4.1 ms    [User: 366.9 ms, System: 49.9 ms]
  Range (min … max):   414.8 ms … 426.1 ms    10 runs
```

- Closes whitequark#918
dgollahon added a commit to dgollahon/parser that referenced this issue Apr 15, 2023
- Replaces O(n^2) processing loop with a set. I benchmarked parsing a file with a single 5000 key hash to test this change's performance. Before:

**Before**

```sh
$ hyperfine --warmup 3 'bin/ruby-parse hash_bench.rb'

Benchmark 1: bin/ruby-parse hash_bench.rb
  Time (mean ± σ):      2.678 s ±  0.029 s    [User: 2.613 s, System: 0.061 s]
  Range (min … max):    2.636 s …  2.738 s    10 runs
```

**After**

```sh
$ hyperfine --warmup 3 'bin/ruby-parse hash_bench.rb'

Benchmark 1: bin/ruby-parse hash_bench.rb
  Time (mean ± σ):     421.3 ms ±   4.1 ms    [User: 366.9 ms, System: 49.9 ms]
  Range (min … max):   414.8 ms … 426.1 ms    10 runs
```

- See whitequark#918
glebm added a commit to glebm/i18n-tasks that referenced this issue Apr 25, 2023
This is the first version which includes the fix (whitequark/parser#924) to a performance issue (whitequark/parser#918) first introduced in 3.0.3.0 (whitequark/parser@547d731)

Fixes #407
glebm added a commit to glebm/i18n-tasks that referenced this issue Apr 25, 2023
This is the first version which includes the fix (whitequark/parser#924) to a performance issue (whitequark/parser#918) first introduced in 3.0.3.0 (whitequark/parser@547d731)

Fixes #407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants