Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat docs with recovery_source as deletes in merges #107979

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 27, 2024

Currently, we do not account the number of documents with 'recovery_source' ready to drop when selecting merge specifications. Previously, a single large segment containing 'recovery_source' documents was considered fully merged, even though Elasticsearch should have triggered a merge to remove them.

With this PR, we're adjusting the merge policy to treat documents with 'recovery_source' ready to drop as deletions when determining merge specifications. Essentially, documents with 'recovery_source' are now treated as soft-deleted by the Elasticsearch merge policy.

We will need a follow-up to trigger merges when the retention leases advance enough to drop soft-deletes and recovery_source.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 28, 2024

I benchmarked this change with the tsdb track and noticed that we triggered merges too eagerly. I'll investigate further. Two things I'm pondering: (1) treating a document with recovery source as 1/2 deletes, and (2) accounting for them only in force-merges.

|                                                        Metric |  Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:| -----:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |       |   320.796       |   315.2         |    -5.59605 |    min |   -1.74% |
|             Min cumulative indexing time across primary shard |       |   320.796       |   315.2         |    -5.59605 |    min |   -1.74% |
|          Median cumulative indexing time across primary shard |       |   320.796       |   315.2         |    -5.59605 |    min |   -1.74% |
|             Max cumulative indexing time across primary shard |       |   320.796       |   315.2         |    -5.59605 |    min |   -1.74% |
|           Cumulative indexing throttle time of primary shards |       |     0           |     0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |       |     0           |     0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |       |     0           |     0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |       |     0           |     0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |       |    93.2255      |    75.7022      |   -17.5233  |    min |  -18.80% |
|                      Cumulative merge count of primary shards |       |    34           |    33           |    -1       |        |   -2.94% |
|                Min cumulative merge time across primary shard |       |    93.2255      |    75.7022      |   -17.5233  |    min |  -18.80% |
|             Median cumulative merge time across primary shard |       |    93.2255      |    75.7022      |   -17.5233  |    min |  -18.80% |
|                Max cumulative merge time across primary shard |       |    93.2255      |    75.7022      |   -17.5233  |    min |  -18.80% |
|              Cumulative merge throttle time of primary shards |       |     0.26115     |     0.0804167   |    -0.18073 |    min |  -69.21% |
|       Min cumulative merge throttle time across primary shard |       |     0.26115     |     0.0804167   |    -0.18073 |    min |  -69.21% |
|    Median cumulative merge throttle time across primary shard |       |     0.26115     |     0.0804167   |    -0.18073 |    min |  -69.21% |
|       Max cumulative merge throttle time across primary shard |       |     0.26115     |     0.0804167   |    -0.18073 |    min |  -69.21% |
|                     Cumulative refresh time of primary shards |       |     2.08598     |     2.55522     |     0.46923 |    min |  +22.49% |
|                    Cumulative refresh count of primary shards |       |    85           |    84           |    -1       |        |   -1.18% |
|              Min cumulative refresh time across primary shard |       |     2.08598     |     2.55522     |     0.46923 |    min |  +22.49% |
|           Median cumulative refresh time across primary shard |       |     2.08598     |     2.55522     |     0.46923 |    min |  +22.49% |
|              Max cumulative refresh time across primary shard |       |     2.08598     |     2.55522     |     0.46923 |    min |  +22.49% |
|                       Cumulative flush time of primary shards |       |    11.9147      |    12.6221      |     0.7074  |    min |   +5.94% |
|                      Cumulative flush count of primary shards |       |    66           |    66           |     0       |        |    0.00% |
|                Min cumulative flush time across primary shard |       |    11.9147      |    12.6221      |     0.7074  |    min |   +5.94% |
|             Median cumulative flush time across primary shard |       |    11.9147      |    12.6221      |     0.7074  |    min |   +5.94% |
|                Max cumulative flush time across primary shard |       |    11.9147      |    12.6221      |     0.7074  |    min |   +5.94% |
|                                       Total Young Gen GC time |       |    50.419       |    50.474       |     0.055   |      s |   +0.11% |
|                                      Total Young Gen GC count |       |   761           |   770           |     9       |        |   +1.18% |
|                                         Total Old Gen GC time |       |     0           |     0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |       |     0           |     0           |     0       |        |    0.00% |
|                                                    Store size |       |     4.68128     |     5.45548     |     0.7742  |     GB |  +16.54% |
|                                                 Translog size |       |     5.12227e-08 |     5.12227e-08 |     0       |     GB |    0.00% |
|                                        Heap used for segments |       |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |       |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |       |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |       |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |       |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |       |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |       |     7           |    35           |    28       |        | +400.00% |
|                                   Total Ingest Pipeline count |       |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |       |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |       |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput | index | 37462.5         | 37638.5         |   175.953   | docs/s |   +0.47% |
|                                               Mean Throughput | index | 39465.3         | 38934.6         |  -530.772   | docs/s |   -1.34% |
|                                             Median Throughput | index | 39041           | 38770.3         |  -270.731   | docs/s |   -0.69% |
|                                                Max Throughput | index | 43571           | 41817.6         | -1753.48    | docs/s |   -4.02% |
|                                       50th percentile latency | index |   864.524       |   824.804       |   -39.72    |     ms |   -4.59% |
|                                       90th percentile latency | index |  1155.46        |  1123.54        |   -31.922   |     ms |   -2.76% |
|                                       99th percentile latency | index |  7010.53        |  7033.46        |    22.9289  |     ms |   +0.33% |
|                                     99.9th percentile latency | index | 10877.9         | 11066           |   188.057   |     ms |   +1.73% |
|                                    99.99th percentile latency | index | 14409.2         | 14533.4         |   124.257   |     ms |   +0.86% |
|                                      100th percentile latency | index | 15428.8         | 18159.1         |  2730.33    |     ms |  +17.70% |
|                                  50th percentile service time | index |   864.524       |   824.804       |   -39.72    |     ms |   -4.59% |
|                                  90th percentile service time | index |  1155.46        |  1123.54        |   -31.922   |     ms |   -2.76% |
|                                  99th percentile service time | index |  7010.53        |  7033.46        |    22.9289  |     ms |   +0.33% |
|                                99.9th percentile service time | index | 10877.9         | 11066           |   188.057   |     ms |   +1.73% |
|                               99.99th percentile service time | index | 14409.2         | 14533.4         |   124.257   |     ms |   +0.86% |
|                                 100th percentile service time | index | 15428.8         | 18159.1         |  2730.33    |     ms |  +17.70% |
|                                                    error rate | index |     0           |     0           |     0       |      % |    0.00% |

@dnhatn
Copy link
Member Author

dnhatn commented Apr 28, 2024

There is a bug in Lucene merges. I've opened: apache/lucene#13324

@dnhatn
Copy link
Member Author

dnhatn commented Apr 29, 2024

Adrien kindly discussed this offline with me. Overriding numDeletesToMerge can be risky; we prefer overriding the isMerged() method instead.

}

@Override
public int numDeletesToMerge(SegmentCommitInfo info, int delCount, IOSupplier<CodecReader> readerSupplier) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's dangerous to piggyback on the number of deletes here. I think we should rather think about having a notion of how much space we free and use that in the MP upstream to make decisions in what to merge? I wanna prevent that we get illegal invariants in terms of the promises we make here on how many docs will be deleted and then after the fact it's not true and the new segment suddenly grew unexpectedly?!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s1monw. That's a great suggestion. I'll implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants