Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated checkpoints added when using retry #2774

Closed
osi opened this issue Sep 17, 2021 · 7 comments · Fixed by #2780
Closed

Repeated checkpoints added when using retry #2774

osi opened this issue Sep 17, 2021 · 7 comments · Fixed by #2780
Labels
type/enhancement A general enhancement
Milestone

Comments

@osi
Copy link
Contributor

osi commented Sep 17, 2021

Expected Behavior

There should be a single checkpoint

Actual Behavior

There are multiple checkpoints.

Steps to Reproduce

@Test
void reproCase() {
        StepVerifier.create(Flux.error(() -> new RuntimeException("rawr"))
                                .cache()
                                .checkpoint("nom")
                                .retryWhen(Retry.max(3))
                    )
                    .expectErrorSatisfies(t -> assertThat(Arrays.asList(t.getCause().getSuppressed()))
                        .hasSize(1)
                        .first(THROWABLE)
                        .hasMessage("\n" +
                                    "Error has been observed at the following site(s):\n" +
                                    "\t|_ checkpoint ⇢ nom\n" +
                                    "Stack trace:"))
                    .verify(Duration.ofSeconds(10));
}

It is generating multiple checkpoints. Only a single one is expected

	|_ checkpoint ⇢ nom
	|_____ checkpoint ⇢ nom
	|_________ checkpoint ⇢ nom
	|_____________ checkpoint ⇢ nom

Your Environment

  • Reactor version(s) used: 3.4.9

I observed this in our full application where the checkpointed stream is ultimately a r2dbc-postgresql notification stream prefixed by a query using mergeSequential. There was a problem connecting to the DB, and that was being retried. The same behavior was observed, and this is a distillation of that. As I don't have a full reproduction with our entire stack, my hope is that fixing this also fixes what we observed.

@reactorbot reactorbot added the ❓need-triage This issue needs triage, hasn't been looked at by a team member yet label Sep 17, 2021
@smaldini
Copy link
Contributor

In the FluxAssembly implementation, the operator is checking for the presence of checkpoint exception tuple. Problem is that i keeps being incremented and contains will always fail the check. I can't remember the rationale behind the increment but if the checkpoint is exactly identical (same parent, description) there shouldnt be a need for increment as part of the tuple. Without i, the tuple would match the correct hashcode.

@simonbasle simonbasle added status/need-decision This needs a decision from the team status/need-investigation This needs more in-depth investigation and removed ❓need-triage This issue needs triage, hasn't been looked at by a team member yet labels Sep 20, 2021
@simonbasle
Copy link
Member

simonbasle commented Sep 21, 2021

without much tests nor comments nor git commit comments to provide more context, I have trouble understanding the rationale behind this i either :(

removing it doesn't seem to fail any test.

On a side note, the additional entries to the backtrace are only there because the "rawr" exception is effectively passed to checkpoint's onError multiple time due to the cache().
It would have been equivalent to do Flux.onError(new RuntimeException("rawr")).

simonbasle added a commit that referenced this issue Sep 21, 2021
This commit removes the `i` counter that was effectively guaranteeing
duplicated entries in the backtrace (eg. with a checkpoint) in the case
where the triggering exception was reused. The only differentiator was
then the indentation (driven by `i`).

This is especially visible when using a `checkpoint(String)` operator
along a `retry`. Provided the same exception instance is emitted on each
retry loop, `checkpoint` will add to the backtrace on each loop.

This commit removes the `i` from the tracking of existing backtraces and
thus ensures that for the same parent subscriber, prefix and description
there will be only one entry in the backtrace.

Fixes #2774.
@simonbasle
Copy link
Member

actually, this one test fails:

public void testMultiReceiver() throws Exception {

are we losing any information by removing these extra-indented "FluxPublish -> at ..." entries ?

@simonbasle simonbasle linked a pull request Sep 22, 2021 that will close this issue
1 task
@simonbasle simonbasle added type/enhancement A general enhancement and removed status/need-investigation This needs more in-depth investigation labels Sep 24, 2021
@simonbasle simonbasle added this to the 3.3.21.RELEASE milestone Sep 24, 2021
@simonbasle
Copy link
Member

simonbasle commented Sep 24, 2021

I have a change that better cater to multiple paths in the Error has been observed at the following site(s) section, but I wonder how disruptive it it to change the representation of the tracebacks 🤔

To the best of my knowledge I don't know of a tool that would parse the traceback, but you never know.

Any insight to provide @osi @smaldini ?

(let's also cc @bsideup who worked on the agent and on tracebacks in the past, he might also have insights)

@simonbasle
Copy link
Member

to clarify, from #2780 the current proposal looks like this:

Assembly trace from producer [reactor.core.publisher.MonoDefer] :
	reactor.core.publisher.Mono.defer(Mono.java:213)
	reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:747)
Error has been observed at the following site(s):
	*____Mono.defer ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:747)
	*_____Flux.then ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:747)
	|_    Mono.flux ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:749)
	|_ Flux.publish ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:749) (observed 2 times)
	|_     Flux.map ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:751)
	*_____Flux.then ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:747)
	|_    Mono.flux ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:749)
	|_ Flux.publish ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:749) (observed 2 times)
	|_  Flux.filter ⇢ at reactor.core.publisher.HooksTest.testMultiReceiver(HooksTest.java:761)

@osi
Copy link
Contributor Author

osi commented Sep 24, 2021

@simonbasle I'm not parsing tracebacks, and don't have awareness of any tooling around it.

The new form looks great!!

@simonbasle simonbasle modified the milestones: 3.3.21.RELEASE, 3.4.11 Sep 27, 2021
@simonbasle simonbasle removed the status/need-decision This needs a decision from the team label Sep 27, 2021
@simonbasle
Copy link
Member

this will be re-targeted to 3.4.x only

simonbasle added a commit that referenced this issue Oct 1, 2021
This commit reworks how lines are added to the traceback second section,
which displays callsites where an exception has been propagated. It
rework the presentation of that section to cater for more complex cases.

As a benefit, it fixes a situation where an unbounded number of retries
over a singleton exception + a checkpoint leads to infinitely growing
traceback.

It also polishes HooksTraceTest to result in less spammy logs,
and updates the reference guide documentation to reflect the new style
(as well as fixing some outdated content).

The change especially helps when the exception is instantiated at
publisher level but is propagated to several subscribe paths (eg.
multiple chains assembled from the same source, or a single chain that
resubscribe like with a retry operator).

Previously, lines where effectively guaranteed to be added because the
stored Tuple4 included a `i` variable that was always incremented.

Now these traceback lines are stored in a `Map`, and effort is made to
detect "multicasting" of the error. Lines that appear more than once are
counted and this is reflected in the traceback line after the callsite
(` (observed x times)`).

An effort is made to group operators that belong to distinct chains
together under a "root", which get repeated and is denoted by a `*_`
prefix and `_` left-padding.

Instead of:
```
 |_      rootOp -> callsite1
 |_ chainOneOp1 -> callsite2
 |_ chainOneOp2 -> callsite3
 |_ chainTwoOp1 -> callsite4
```

We now get:
```
 |*______rootOp -> callsite1 (observed 2 times)
 |_ chainOneOp1 -> callsite2
 |_ chainOneOp2 -> callsite3
 |*______rootOp -> callsite1 (observed 2 times)
 |_ chainTwoOp1 -> callsite4
```

This makes it clearer that the error propagates through `rootOp`, which
is used to construct two distinct chains (and subscriptions to both
chains end up seeing the same instance of the error).

Fixes #2774.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement A general enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants