New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Remove the Sink impl from Fanout #11206
Conversation
✔️ Deploy Preview for vector-project canceled. 🔨 Explore the source changes: 497a94d 🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/6203f393053bb0000839f720 |
Soak Test ResultsBaseline: 6b777be ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00%:
Fine details of change detection per experiment.
Fine details of each soak run.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a step in the right direction design-wise, it just came with a couple small steps back perf-wise 😄. There are a couple of things I'd be curious to see combined with this:
- Not allocating a new
FuturesUnordered
for every event, either by having one as a struct member or moving from asend(Event)
API to asend_all(Stream<Event>)
API. It seems like we're going to add a few allocations by usingFuturesUnordered
, but we should be able to amortize the cost. - Pushing batches with each inner
send
call rather than single events. If we give each sink more to chew on with each iteration, that should give us a better chance to be ready to feed more as soon as they can accept it. - Getting rid of the extra event clone when there's only one sink
Soak Test ResultsBaseline: 5f81641 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes with confidence ≥ 90.00%: Fine details of change detection per experiment.
Fine details of each soak run.
|
Soak Test ResultsBaseline: 0ba0d97 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00%:
Fine details of change detection per experiment.
Fine details of each soak run.
|
d26acd5
to
b167744
Compare
Soak Test ResultsBaseline: 1eab953 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes with confidence ≥ 90.00%: Fine details of change detection per experiment.
Fine details of each soak run.
|
Soak Test ResultsBaseline: 1c0c15c ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00%:
Fine details of change detection per experiment.
Fine details of each soak run.
|
Hmm, well, the soak results are okay-ish. Presumably if we lifted the More concerning is a panic that's new to this work but does not fail the tests. Make your working directory
This behavior doesn't exist on master but I'm not overly comfortable with a panic happening quietly in the background. |
Our grok crate is unusual. We use regex in certain places like date parsing and onig when dealing with the grok patterns themselves. Work on #11206 strongly suggests that if sinks are allowed to run less subject to slow receiver issue then `apply_date_filter` is a CPU hotspot. As suggested by @tobz we enable the 'perf' flags to the regex crate, which features heavily in the implementation of that function. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit removes the use of lazy_static! in favor of OnceCell's Lazy. They are functionally similar although OnceCell's implementation has better characteristics in contended situations. This change was motivated by examination of the profiling results from #11206 where a regex to do with date parsing was seen before a CPU quiet period. It's unknown if this change will materially impact that, however we should no longer use lazy_static and so the time seemed ripe. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Our grok crate is unusual. We use regex in certain places like date parsing and onig when dealing with the grok patterns themselves. Work on #11206 strongly suggests that if sinks are allowed to run less subject to slow receiver issue then `apply_date_filter` is a CPU hotspot. As suggested by @tobz we enable the 'perf' flags to the regex crate, which features heavily in the implementation of that function. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
I think it's fine to adjust those tests to match any change in behavior here that's actually beneficial. We do want to have a limited number of events we'll pull through in the case of backpressure, but as long as that number is finite and not reasonably large then that seems sufficient.
It looks like this must be the sink built here that's getting polled after completion. I initially thought it might be related to #11009 where we removed a couple of |
* Remove explicit lazy_static in project This commit removes the use of lazy_static! in favor of OnceCell's Lazy. They are functionally similar although OnceCell's implementation has better characteristics in contended situations. This change was motivated by examination of the profiling results from #11206 where a regex to do with date parsing was seen before a CPU quiet period. It's unknown if this change will materially impact that, however we should no longer use lazy_static and so the time seemed ripe. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * unused ding in vrl/stdlib Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Our grok crate is unusual. We use regex in certain places like date parsing and onig when dealing with the grok patterns themselves. Work on #11206 strongly suggests that if sinks are allowed to run less subject to slow receiver issue then `apply_date_filter` is a CPU hotspot. As suggested by @tobz we enable the 'perf' flags to the regex crate, which features heavily in the implementation of that function. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
6a6765c
to
1a45458
Compare
Alright, after further discussion between myself, @lukesteensen and @tobz we're unconcerned with the panic as it's seemingly a timing issue made more likely by this change and we'd like to make larger structural changes to remove |
lib/vector-core/src/fanout.rs
Outdated
control_channel: control_rx, | ||
}; | ||
|
||
(fanout, control_tx) | ||
} | ||
|
||
// TODO make a real error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did notice that all of these are Box<dyn Sink<Event, Error = ()>
, so I'm not sure how real of an error we can make, but should probably resolve the TODO one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot, completely forgot about this comment.
} | ||
} | ||
|
||
while let Some(res) = jobs.next().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we need to select
over this and the control channel here. For example, if a sink is blocked and there's a config reload to remove it, we'd want to drop the sink asap rather than wait (potentially forever) for a successful send
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, actually, I think this is okay? The reload happens while we're here in this loop. The sink feed/flush fails, that error bubbles up here, the sink is marked as faulty and nuked. At least, my understanding was that when a shutdown of a component happened write/flushes into it would fail appropriately. Is that incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, normal shutdowns essentially involve disconnecting the component's input and letting the stream it's reading end naturally. The only place that we really thread a shutdown signal is into sources. So from Fanout
's perspective, it's going to see a Remove
message come through before the component itself is actually stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, hmm. So, by the time we hit send_all
part of that stream has been read -- whatever was ready to go up to a certain size -- and delivered here. We hit this loop, the caller has stopped reading from the stream by now and the sink is still alive to process the future against it. Next call to send_all
the control messages are polled, stopped sink is removed.
I think that logic is accurate and we don't need to poll the control messages here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Does that break this whole patch? So, right now we can't associate a Future stuffed into FuturesUnordered with a name, which is what the ControlMessage
communicates. In order for selecting on the control channel to work we have to be able to manipulate the vector of sinks and ignore / cancel any future associated with the specific GenericEventSink
. In your example above, same scenario, but by the time point 6 comes around we poll the control channel and know which named sink is removed but can't get at the future to cancel it.
I suppose we could, instead of using a Future, use a JoinHandle instead and implement effectively the upcoming tokio JoinSet
but with reference by name capability too but that seems like a non-trivial leap in complexity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to think through the same thing. It feels like there should be some way to make the send task cancellable (joining with a per-sink tokio::sync::Notify
that we trigger when a sink is being removed, or similar), but it's definitely a significant bump in complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder, actually, if we aren't overloading what the Fanout
does. So, I can imagine a Fanout
whose sole job is to process incoming Vec<Event>
instances and not control messages working just fine, relatively simply. We stick the processing of control messages one level up and create brand new Fanout
instances on reload, newly wired: shut down the old Fanout
, plug in the stream to the new Fanout
and let it go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an interesting idea. I still think you have some form of the same issue, where you want to be able to abort writes to a specific sink when it is removed. If Fanout
held a fixed set of sinks and was replaced wholesale on reload, you wouldn't really have a way to know what it was blocking on when you went to replace it. You could have it expose that data somehow, but that feels like it starts to lose the nice layering and simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. What I was thinking was when the Fanout
flips over into shutdown mode it is responsible for attempting to poll every outstanding Future
, cancel those that aren't ready. But now that I think about it that's dicey because any Event
we've pushed in has the potential to be lost and I can think of a scenario where reloads happen rapidly enough to mean there's no actual Event
transmission.
Soak Test ResultsBaseline: a39a7a0 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes with confidence ≥ 90.00%: Fine details of change detection per experiment.
Fine details of each soak run.
|
8ac2b1c
to
14af57d
Compare
14af57d
to
4f810a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much substantial to say about this approach, just a couple of nit picks.
This commit is kind of a peer to #11197, in pursuit of #10144. The basic idea here is we want to get rid of the lock-step behavior that a strict Sink requires for fanning out, because we know that we have issues with slow receivers. This commit is not nearly so intrusive as #11197 but does allow for incremental progress among the cohort of downstream sinks. I still have some tests commented out. The error propagation story here is yet to be worked on. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Co-authored-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
4f810a2
to
4b1e68a
Compare
Soak Test ResultsBaseline: 468d525 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes with confidence ≥ 90.00%: Fine details of change detection per experiment.
Fine details of each soak run.
|
Co-authored-by: Bruce Guenter <bruce.guenter@datadoghq.com>
This commit is kind of a peer to #11197, in pursuit of #10144. The basic idea
here is we want to get rid of the lock-step behavior that a strict Sink requires
for fanning out, because we know that we have issues with slow receivers. This
commit is not nearly so intrusive as #11197 but does allow for incremental
progress among the cohort of downstream sinks.
I still have some tests commented out. The error propagation story here is yet
to be worked on.
Signed-off-by: Brian L. Troutwine brian@troutwine.us