Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance degradation with GoJa JS tracing debug_traceBlockByNumber #25125

Closed
Gilad-Eisenberger opened this issue Jun 19, 2022 · 8 comments · Fixed by #25156
Closed

Comments

@Gilad-Eisenberger
Copy link

We're experiencing a significant performance degradation with the latest version v1.10.19 .

Our flows using debug_traceBlockByNumber using custom JS tracers are putting significantly higher loads on the nodes when executing a single trace. Each trace now uses all cores on the machine while producing the trace.
Execution times remain as they were in previous versions. This means traces now consume significantly more resources to produce, and parallelize much worse than before.

For example, running a single debug_traceBlockByNumber using the legacy prestateTracer (.js version), the trace takes around 8000-9000ms to complete. During the time, it consumes all 56 cores on the node's machine (>80% cpu used).
The built-in tracer on the same block completes in 3000ms (expected to be faster), however does not consume more than 1 core during processing.
Earlier versions complete the .js trace in the same amount of time, however do not consume more than one core during processing.

This is a regression from previous versions, which timing-wise appears related to the GoJa tracer changes, however we cannot confirm the root cause.

Any help in investigating the root cause would be greatly appreciated. Additionally, any way to limit the concurrency of a single request would be great as well.

@Gilad-Eisenberger
Copy link
Author

Likely related to #23773 & #24934

@karalabe
Copy link
Member

@s1na Could you please take a peek? Thx

@s1na
Copy link
Contributor

s1na commented Jun 20, 2022

Hey @Gilad-Eisenberger. Thanks for reporting this. I want to help get to the bottom of this.

Unfortunately I don't have access to a machine with 56 cores. I just tried on one with 4 cores and both the duktape tracer and goja tracers consume all 4 cores to do debug.traceBlockByNumber for the JS prestateTracer (with Goja taking longer but this was a known regression that I'm planning to fix). Other JS tracers seem on-par or faster for Goja.

I'd like to know if there are inefficiencies in the Goja tracers that are highlighted by your custom tracer. Is it something you can share? Have you tried other JS tracers' resource consumption (e.g. callTracerLegacy)? If you checkout commit af02e97 you have both engines available (e.g. callTracerLegacy for goja and callTracerLegacyDuktape for duktape).

@Gilad-Eisenberger
Copy link
Author

Much appreciated.

I was able to take some time to run some benchmarking on different configurations, trying out versions 1.10.17 & 1.10.19, with different core configurations (8 & 32, it's the same machine just changing affinity settings to limit the process to specific cores).
In both cases the tracers are the legacy call & prestate js tracers in the git master branch.

For each configuration, I ran increasing parallel requests to trace the same block, at the same time, from 1 to 8 concurrent requests.

Tracer      |                          Legacy Call Tracer                         |                        Legacy Prestate Tracer                 
Version     |              1.10.17             |              1.10.19             |              1.10.17              |              1.10.19         
Core Count  |       8 core    |      32 core   |       8 core    |      32 core   |       8 core    |      32 core    |       8 core    |      32 core
# parallel ---------------------------------------------------------------------------------------------------------------------------------
1           |    5651.841ms   |   5790.074ms   |    5320.909ms   |   3877.548ms   |    5830.081ms   |    6003.797ms   |   14168.277ms   |    7656.352ms
2           |    6917.180ms   |   6612.210ms   |    6481.169ms   |   7457.299ms   |    6728.201ms   |    5796.719ms   |   27332.711ms   |   16156.832ms
3           |    7791.324ms   |   6183.286ms   |    9582.127ms   |   5034.659ms   |    8984.426ms   |   11535.039ms   |   34059.716ms   |   15572.685ms
4           |    9766.498ms   |   6977.169ms   |    8638.260ms   |   4989.417ms   |   14653.078ms   |   12270.590ms   |         Error   |   26252.477ms
5           |   11432.268ms   |   7476.753ms   |    9979.610ms   |   5785.164ms   |   12155.815ms   |   14200.428ms   |                 |   26400.611ms
6           |   12256.071ms   |   7213.083ms   |   11683.305ms   |   6576.304ms   |   14198.681ms   |   11765.026ms   |                 |         Error
7           |   15136.179ms   |   8411.119ms   |   13068.131ms   |   6699.006ms   |   16897.552ms   |   16520.059ms   |                 |   
8           |   16679.434ms   |   8623.590ms   |   14021.331ms   |   7748.165ms   |   21567.975ms   |   11851.511ms   |                 |   

It looks like with callTracer everything is good in both versions. 1.10.19 provides a slight increase in performance. With the prestate tracer, 1.10.19 is much slower, and scales much worse as the concurrency increases. This is also the only configuration I've seen fail.

Also, while all configurations seem to use multiple cores, 1.10.19 on prestateTracer seems to max out all cores much sooner.

I've tried modifying the tracer in an attempt to improve performance, but have not made any significant progress.

Is this the type of performance difference you're expecting with this tracer on Goja?

Additionally, it looks like there's some discrepancy between the total CPU used and performance. This seems to apply to all configurations - but having & using more cores doesn't result in a faster result? Not sure how to investigate this part. I would naively expect the tracing of a block to be single threaded in nature, but it seems to utilize many cores?

Any thoughts would be apprecaited

@karalabe
Copy link
Member

I would naively expect the tracing of a block to be single threaded in nature, but it seems to utilize many cores?

Interjecting just to respond to this observation. When running JavaScript tracers, a trace run will usually take a lot more time than just executing the same transaction from Go. So if you're tracing a block (or a chain segment), what we do is that we run ahead with Go's tx execution, creating starting points for the .js tracers; and then on many threads pick up these starting points and run the slow tracers on top.

Sina made some optimizations that - AFAIK - allows us to only look at certain events (of maybe even only certain opcodes for tracers), which might make them closed to Go's speed. But as for why the prestate tracer might be slower, that's odd as it would still need to run mostly the same code, just perhaps stash away the prestate data somewhere. Perhaps Goja's internal data storage/maps are not performing well and thus the many prestate inserts are causing a slowdown?

@Gilad-Eisenberger
Copy link
Author

That's helpful information. That explains the multi-core use I was not expecting.

Removing the toHex() on the prestate tracer's code seems to have the most significant impact. From my understanding the code buffer is passing the go-js boundary 3 times (once as a buffer to the js, then back as a buffer to the go, then finally once more as a string to the js again). I couldn't find an efficient way to convert the original UInt8Array to hex since Buffer does not seem available, but perhaps that could be an effective workaround.

Our main concern currently is Gray Glacier that's upcoming and will force us to upgrade all nodes. Thanks for all the help so far.

@s1na
Copy link
Contributor

s1na commented Jun 23, 2022

Pointing to toHex was a good hint. The problem is as I guessed in conversion of Go []byte to JS Uint8Array and back. I have a small benchmark comparing these conversions for Duktape vs Goja and the difference is stark.

Goja exposes no explicit API to do this so we've had to do with hacky approaches. I modified our conversion functions slightly and it seems to now have performance on-par with duktape on a limited dataset. See #25156. I have to test it more both for correctness and speed, but would appreciate it if you could try it with your own tracer and see how it fares.

@Gilad-Eisenberger
Copy link
Author

Thanks for the effort. I was able to verify the behavior on a cloud-drive based node, and it looks good, definitely better than before. Thanks again for the assistance and giving this focus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@karalabe @s1na @Gilad-Eisenberger and others