Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(profiling): Tune the sample profile generation code for performance #1694

Conversation

Zylphrex
Copy link
Member

We noticed that generating the sample format at the end of a profile can get rather slow and this aims to improve what we can here with minimal changes. A few things we took advantage of to accomplish this:

  • Turning the extracted stack into a tuple so it is hashable so it can be used as a dictionary key. This let's us check if the stack is indexed first, and skip indexing the frames again. This is especially effective in profiles where it's blocking on a network request for example, since there will be many identical stacks.
  • Using the hash of the stack as the dictionary key. Hashing the entire stack can be an expensive operation since a stack can have up to 128 frames. Using it as a dictionary key means it needs to be rehashed each time. To avoid this, we pre-hash the stack and use the hash as a dictionary key which is more efficient.
  • Convert numbers to strings ahead of time if we know have to. Values like the tid and elapsed since start ns needs to be sent as a string. However, many samples share the same value for it, and we're doing the conversion each time. Instead, we convert them to a string upfront and reuse it as needed in order to minimize unnecessary calculations.

@Zylphrex
Copy link
Member Author

Zylphrex commented Oct 19, 2022

Some benchmarks I collected comparing the before and afters using a few arbitrary profiles.

samples: 439		stacks: 4326
old	avg of 100 runs	   21.9256ms
new	avg of 100 runs	    3.5280ms
samples: 507		stacks: 4227
old	avg of 100 runs	   18.8779ms
new	avg of 100 runs	    4.2562ms
samples: 1688		stacks: 15167
old	avg of 100 runs	   61.5474ms
new	avg of 100 runs	   14.6774ms
samples: 418		stacks: 3347
old	avg of 100 runs	   14.5421ms
new	avg of 100 runs	    2.5589ms
samples: 513		stacks: 2568
old	avg of 100 runs	   14.5687ms
new	avg of 100 runs	    2.9569ms
samples: 379		stacks: 758
old	avg of 100 runs	    5.7061ms
new	avg of 100 runs	    1.5271ms

From this testing, this looks to be consistently faster than what it used to be, and I have no reason to believe it's slower in any case.

@antonpirker
Copy link
Member

Side note: The "daphne" errors above in the Django tests was fixed in master, to if you update your branches they should be gone.

Base automatically changed from txiao/tests/add-tests-for-thread-schedulers to master October 20, 2022 12:56
@Zylphrex Zylphrex requested review from a team, sl0thentr0py and antonpirker October 20, 2022 15:42
We noticed that generating the sample format at the end of a profile can get
rather slow and this aims to improve what we can here with minimal changes. A
few things we took advantage of to accomplish this:

- Turning the extracted stack into a tuple so it is hashable so it can be used
  as a dictionary key. This let's us check if the stack is indexed first, and
  skip indexing the frames again. This is especially effective in profiles where
  it's blocking on a network request for example, since there will be many
  identical stacks.
- Using the hash of the stack as the dictionary key. Hashing the entire stack
  can be an expensive operation since a stack can have up to 128 frames. Using
  it as a dictionary key means it needs to be rehashed each time. To avoid this,
  we pre-hash the stack and use the hash as a dictionary key which is more
  efficient.
- Convert numbers to strings ahead of time if we know have to. Values like the
  tid and elapsed since start ns needs to be sent as a string. However, many
  samples share the same value for it, and we're doing the conversion each time.
  Instead, we convert them to a string upfront and reuse it as needed in order
  to minimize unnecessary calculations.
@Zylphrex Zylphrex force-pushed the txiao/perf/tune-the-sample-profile-generation-code-for-performance branch from fba25ec to 3b316c2 Compare October 20, 2022 19:51
Copy link
Member

@sl0thentr0py sl0thentr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice numbers 🎉

@Zylphrex Zylphrex merged commit a8fdcb0 into master Oct 21, 2022
@Zylphrex Zylphrex deleted the txiao/perf/tune-the-sample-profile-generation-code-for-performance branch October 21, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants