Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbolic information binary format #2926

Open
kolesnikovae opened this issue Jan 16, 2024 · 1 comment · May be fixed by #3138
Open

Symbolic information binary format #2926

kolesnikovae opened this issue Jan 16, 2024 · 1 comment · May be fixed by #3138
Assignees
Labels
performance If there's anything we have to be really good at it's this storage Low level storage matters

Comments

@kolesnikovae
Copy link
Collaborator

kolesnikovae commented Jan 16, 2024

Pyroscope stores symbolic information such as locations, functions, mappings, and strings in column-major order, in parquet format. We define schema dynamically, and have hand-written costruct/deconstruct procedures for each of the models. While it gives us a simple and convenient way to manage and maintain the storage schema, the approach has its own disadvantages:

  1. We always read all the model fields/columns. In the meantime, read/write buffers are allocated for each of the columns, which causes excessive IO and resource usage.
  2. Fairly expensive decoding (~5-7% of the query CPU time).
  3. Read amplification caused by the fact that a partition can overlap parquet column chunk page boundaries.
  4. Despite the small size of the payload, fetching of the partitions is often responsible for tail latencies. The impact is even more pronounced on downsampled/aggregated data.

In the screenshot below you can see that a parquetTableRange.fetch call lasted for 3 seconds (with no good reason – probably it was blocked by async page reader that is shared with profile table reader):

image

I propose to develop a custom binary format and low level encoders and decoders for the data models. The data should be organised in row-major order. I expect that it will effectively remove symbolic data retrieval from the list of query latency factors.

@kolesnikovae kolesnikovae self-assigned this Jan 16, 2024
@kolesnikovae kolesnikovae added storage Low level storage matters performance If there's anything we have to be really good at it's this labels Jan 16, 2024
@cyriltovena
Copy link
Contributor

Definitively agree to aim at reducing IO for symbols but I think it's not just parquet it seems stracktraces.symdb is also causing tail latency.

@kolesnikovae kolesnikovae linked a pull request Apr 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance If there's anything we have to be really good at it's this storage Low level storage matters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants