Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: symdb custom binary format #3138

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Mar 27, 2024

Resolves #2926

The change eliminates the use of parquet tables from symdb. This significantly improves read selectivity for symbolic information and, more importantly, enables fetching symbolic information directly from blocks in the object storage without the need to keep parquet files open in memory (in ingesters and store-gateways).

Note that the new format is not enabled by default. This is done for backward compatibility purposes (a feature flag, of sorts). Later, after more intensive internal testing, the format will be enabled by default.


Compression

The new encoding allows to achieve up to 30% reduction in size on disk.

Using the current encoding and block layout:

drwxr-xr-x  8 kolesnikovae  staff       256 .
drwxr-xr-x  8 kolesnikovae  staff       256 ..
-rw-r--r--  1 kolesnikovae  staff  14019192 functions.parquet
-rw-r--r--  1 kolesnikovae  staff     17900 index.symdb
-rw-r--r--  1 kolesnikovae  staff  31114499 locations.parquet
-rw-r--r--  1 kolesnikovae  staff      8186 mappings.parquet
-rw-r--r--  1 kolesnikovae  staff  66516066 stacktraces.symdb
-rw-r--r--  1 kolesnikovae  staff  10345291 strings.parquet

Encoded in the new format:

-rw-r--r--  1 kolesnikovae  staff  88851848 symbols.symdb

Performance

Even though the change was not aimed at directly optimizing performance in terms of query latencies, benchmarks show a ~10-20% reduction in the overall query duration for SelectMergeByStacktraces:

Single service:

goos: darwin
goarch: arm64
pkg: github.com/grafana/pyroscope/pkg/phlaredb
                          │   before    │               after                │
                          │   sec/op    │   sec/op     vs base               │
_SelectMergeByStacktraces   526.4m ± 2%   474.9m ± 3%  -9.79% (p=0.000 n=10)

                          │    before    │                after                 │
                          │     B/op     │     B/op      vs base                │
_SelectMergeByStacktraces   288.7Mi ± 5%   225.2Mi ± 3%  -22.00% (p=0.000 n=10)

                          │   before    │               after                │
                          │  allocs/op  │  allocs/op   vs base               │
_SelectMergeByStacktraces   2.278M ± 0%   2.155M ± 0%  -5.40% (p=0.000 n=10)

The whole block ({}):

goos: darwin
goarch: arm64
pkg: github.com/grafana/pyroscope/pkg/phlaredb
                          │   before    │               after                │
                          │   sec/op    │   sec/op    vs base                │
_SelectMergeByStacktraces   5.544 ± 16%   4.471 ± 5%  -19.35% (p=0.000 n=10)

                          │    before    │                after                 │
                          │     B/op     │     B/op      vs base                │
_SelectMergeByStacktraces   5.815Gi ± 1%   2.878Gi ± 1%  -50.50% (p=0.000 n=10)

                          │   before    │               after                │
                          │  allocs/op  │  allocs/op   vs base               │
_SelectMergeByStacktraces   27.31M ± 0%   25.40M ± 0%  -7.02% (p=0.000 n=10)

The test dataset comprises real-life data from one of the internal deployments – 1GB collected over one hour.

@kolesnikovae kolesnikovae changed the title feat: symdb strings encoding feat: symdb custom binary format Mar 27, 2024
@knylander-grafana
Copy link
Contributor

Do we need to update the docs for this feature?

@kolesnikovae
Copy link
Collaborator Author

@knylander-grafana, I'll update the https://grafana.com/docs/pyroscope/latest/reference-pyroscope-architecture/block-format page – just a couple of lines, nothing very important

@kolesnikovae kolesnikovae marked this pull request as ready for review April 29, 2024 05:56
@kolesnikovae kolesnikovae requested review from a team as code owners April 29, 2024 05:56
@knylander-grafana knylander-grafana added the type/docs Improvements for doc docs. Used by Docs team for project management label May 14, 2024
Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the reference architecture.

@kolesnikovae kolesnikovae added storage Low level storage matters performance If there's anything we have to be really good at it's this labels May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance If there's anything we have to be really good at it's this storage Low level storage matters type/docs Improvements for doc docs. Used by Docs team for project management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Symbolic information binary format
2 participants