Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch starvation can cause OOMs #16839

Open
serathius opened this issue Oct 27, 2023 · 12 comments
Open

Watch starvation can cause OOMs #16839

serathius opened this issue Oct 27, 2023 · 12 comments

Comments

@serathius
Copy link
Member

serathius commented Oct 27, 2023

What would you like to be added?

Please read #16839 (comment) for up to date information about the problem

@serathius
Copy link
Member Author

serathius commented Oct 27, 2023

Profiles I promised. They are less extreme than onces we have seen originally. They are more recent, as we reduced the write throughput later. Still they show the problem:

32% of allocations done by Range to get PrevKey for Watch

image

57% of cpu used on GC
image

@serathius serathius changed the title Enabling watch PrevKey causes high allocations Watch PrevKey causes high allocations Oct 27, 2023
@serathius
Copy link
Member Author

serathius commented Oct 30, 2023

One clarification, access to prevKey in

_, created, ver, err := tw.s.kvindex.Get(key, rev)
doesn't touch bbolt, only the in memory index.

This means that prevKey is not necessarily accessed in apply loop (only if put has p.IgnoreValue || p.IgnoreLease || p.PrevKv). This means we have tradeoff:

  • Access prevKey in apply loop and risk that there is no watch requesting prevKey.
  • Access prevKey outside apply loop and copy buffer.

The tradeoff is not clear, we need a proper scalability testing to verify performance in different scenarios.

@chaochn47
Copy link
Member

chaochn47 commented Mar 7, 2024

xref. #17529 (comment).

The ConcurrentReadTxMode of read Txn could be as slow as half a second executed in a single threaded (single go routine) in the per serverWatchStream.sendLoop. This would heavily impact per serverWatchStream throughput.

It just reminds me of #12896 which changes to mvcc.SharedBufReadTxMode if the Txn contains a write operation to avoid extra overhead of copying buffer. Potential direction of improvement we can make in the future.

@chaochn47
Copy link
Member

The bigger issue about prevKV is for a single write, we copy the event for the matching watchers and for each watcher, we copy the entire transaction buffer just to get prevKV and it would duplicate 100 time (let's say there are 100 watchers).

Hence it is not efficient IMHO.

@ahrtr
Copy link
Member

ahrtr commented Mar 8, 2024

  • Reading a previous version for watch is executed outside of apply loop, meaning we can no longer use blocking read, but a ConcurrentReadTxnMode, which copies the transaction buffer. This is very costly in allocations.

Indeed there are two issues:

  • for the extra costly ConcurrentReadTxnMode read, can we mitigate it by using a SharedBufReadTxMode?
  • for the executed outside of apply loop issue, it makes the whole watch response processing not atomic any more. If a compaction happens in between, then we may not read the prevKV. It might not be a serious problem.

@wojtek-t
Copy link
Contributor

/cc

@serathius
Copy link
Member Author

serathius commented Mar 11, 2024

Deleted the comment above because is was totally inaccurate.

tldr; Issue is not related to prevKV, but to overall watch efficiency during watch starvation. We need to implement #17563 to fix the etcd memory issue.

The main reason that enabling prevKV had such impact is fact that it doubles the event size, meaning it puts the twice as much constrain on the response stream. In my testing I was close to the response throughput limit (~1.6GB/s) even without prevKV, so enabling it caused the memory leak. I repeated the tests where I compared watch streams based on the response throughput (measured using etcd_network_client_grpc_sent_bytes_total metric).

In the repeated tests I have found that prevKV is totally not at fault, the issues is caused by inefficient behavior of syncWatchers when the watch stream is clogged. It will repeatably try to sync watchers every 100ms rereading all the events, but effectively syncing only small subset of watchers. In my testing with 100 watchers, I have observed that sync loop touched each revision 25 times, meaning we allocate and deallocate the etcd watch window 25 times. I also found that
for 100 watchers, reading prevKV 100 times per event is less costly than rereading watch history 25 times by syncWatchers.

prevKV object size response throughput[MB/s] etcd memory [GB] 90%ile watch event latency[s] 99%ile watch event latency[s] 99.9%ile watch event latency[s] 90%ile put latency[s] 99%ile put latency[s] 99.9%ile put latency[s]
TRUE 50KB 1003 0.389 0.116 0.57 0.843 0.001 0.003 0.01
FALSE 100KB 1065 0.5637 0.099 0.495 0.6629 0.001 0.004 0.012
TRUE 100KB 1591 4 6.56 10.16 12.78 0.002 0.007 0.014
FALSE 200KB 1670 6.1 5.5 8.34 9.96 0.003 0.011 0.023

The table above shows 2 sets of tests for response through below the limit 1GB/s, and one above the limit 1.6GB/s. I adjusted the object size between enabled and disabled prevKV to get the similar response throughput.

Conclusions: Same as before, changes by @chaochn47 will not address the issue. No need to improve prevKV (at least for now), to address the issue of etcd memory bloating we need to improve the reuse of event history like in #17563

@serathius serathius changed the title Watch PrevKey causes high allocations Watch starvation can cause OOMs Mar 11, 2024
@serathius
Copy link
Member Author

Have one more idea, reduce the frequency of syncWatchers to allow more batching.

@serathius
Copy link
Member Author

serathius commented Mar 12, 2024

Amazing just reducing the syncWatchers frequency improves watch latency and memory usage.

sync duration[s] response throughput[MB/s] etcd memory [GB] 50%ile watch event latency[s] 90%ile watch event latency[s] 99%ile watch event latency[s] 99.9%ile watch event latency[s]
0.1 (current) 1620 6.6 2.93 6.34 9.64 14.04
0.5 1671 3.2 2.83 5.46 8.4 11.04
1 (proposed) 1736 2.1 2.56 4.29 6.18 7.78
2 1728 1.7 3.28 5.19 6.53 7.44
5 1204 2.2 6.16 9.38 13.42 16.43

Looks like the improvements peek around 1s, improving the watch response throughput by 4%, reducing memory allocations by 4-5 times (base memory for puts is 850MB) and reducing the latency 30-45%. This already looks like great improvement.

A proper implementation of reusing events will be complicated, making backport questionable, thus I propose to a simple backportable solution to reduce impact of slow watchers.

Proposal: Reduce the syncLoop period from 0.1s to 1s

Want to confirm how the original value was proposed.

Note; there is no risk of syncLoop not catching up with unsynched watchers as it also includes shortcut if it cannot keep up

func (s *watchableStore) syncWatchersLoop() {
defer s.wg.Done()
waitDuration := 100 * time.Millisecond
delayTicker := time.NewTicker(waitDuration)
defer delayTicker.Stop()
for {
s.mu.RLock()
st := time.Now()
lastUnsyncedWatchers := s.unsynced.size()
s.mu.RUnlock()
unsyncedWatchers := 0
if lastUnsyncedWatchers > 0 {
unsyncedWatchers = s.syncWatchers()
}
syncDuration := time.Since(st)
delayTicker.Reset(waitDuration)
// more work pending?
if unsyncedWatchers != 0 && lastUnsyncedWatchers > unsyncedWatchers {
// be fair to other store operations by yielding time taken
delayTicker.Reset(syncDuration)
}
select {
case <-delayTicker.C:
case <-s.stopc:
return
}
}
}

@serathius
Copy link
Member Author

serathius commented Mar 12, 2024

Validated that 1s performs better than 0.1s for different number of streams:

object size streams sync duration[s] response throughput[MB/s] etcd memory [GB] 50%ile watch event latency[s] 90%ile watch event latency[s] 99%ile watch event latency[s] 99.9%ile watch event latency[s]
1MB 20 0.1 1579 14.9 3.47 6.32 8.22 9.4
1MB 20 1 1634 9 3.45 5.63 7.98 9.34
30KB 1000 0.1 2006 5.5 4.26 12.74 18.81 22.69
30KB 1000 1 2257 1.4 4.07 11.76 17.8 21.87
3KB 10000 0.1 2061 1.8 4.78 12.82 19.33 23.68
3KB 10000 1 2154 1.1 4.29 11.74 18.01 22.36

Tested all sensible number of watchers from minimal (minimal value that below max write size <1MB limit) up to 10`000, which is 20 times more than limit of watchers per sync. Adjusted object size to ensure that watch stream is congested.

@serathius
Copy link
Member Author

serathius commented Mar 12, 2024

synchWatchers period was always set to 100ms from first implementation of watch #3507 Didn't find any discussion pointing to why 100ms was picked.

This makes me feel convinced that we can change it to improve handling of watch stream starvation.

@ahrtr
Copy link
Member

ahrtr commented Mar 15, 2024

Thanks for the performance test results.

Proposal: Reduce the syncLoop period from 0.1s to 1s

I think it's more prudent to introduce a config item instead of hard coding the sync loop interval/frequency.

Since the history will never change (of course it can be compacted), so the idea of reusing events (#17563) should be OK.

We should also consider to limit the max data/events to read. Currently it always read all the data/events in range [minRev, curRev]. Obviously it may consume huge memory if there are huge number of events in the range.

revs, vs := tx.UnsafeRange(schema.Key, minBytes, maxBytes, 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants