Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: retry and retry_async support streaming rpcs #495

Merged
merged 220 commits into from
Dec 12, 2023
Merged

Conversation

daniel-sanche
Copy link
Contributor

@daniel-sanche daniel-sanche commented Apr 7, 2023

From #485:

Server streaming libraries return an iterable that can asynchronously yield data from the backend over time. Some of our libraries need to provide a wrapper around the raw stream to do some local processing, before passing the data to the user.

It would be useful to wrap this whole pipeline in a retry decorator, so that if the stream breaks mid-way through, we can recover and continue yielding data through our generator as if nothing happened.

Unfortunately, the current implementation returns the result of the target function directly, so generators will not yield values and exceptions through the retry block

This PR addresses the issue by adding retry_target_generator functions to both the async and sync retry modules, which yield through the target rather than call it directly. Generator mode can be enabled using the is_generator argument on the decorator.

Fixes #485

@parthea
Copy link
Collaborator

parthea commented Dec 7, 2023

Can the breaking change be avoided by deprecating functionality instead of changing it?

@daniel-sanche
Copy link
Contributor Author

Can the breaking change be avoided by deprecating functionality instead of changing it?

Hmm yeah I guess it could, if we want to keep two versions of the AsyncRetry class hanging around for a while

As I mentioned above though, IMO this doesn't have to be a breaking change. The only breaking part is changing an undocumented behaviour, which would only manifest as a bug when used with our gapic libraries, which have their own timeout logic.

But I'll let you make that call. Let me know which option you prefer

@vchudnov-g
Copy link
Contributor

I would avoid having two copies of the Async Retry. As to whether we should flag this as a breaking change.... I want to say it's just a bug fix to the behavior but does not change the surface, so maybe we can call it non-breaking..... Not a 100% confident about this yet.

assert retry_._deadline == 4
assert retry_._on_error is _some_function
@pytest.mark.asyncio
async def test_retry_streaming_target_bad_sleep_generator():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still in the unary_asyn test file.


with pytest.raises(exceptions.RetryError) as exc_info:
await retry_async.retry_target(target, predicate, range(10), deadline=10)
await retry_async.retry_target(target, predicate, range(10), **timout_kwarg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await retry_async.retry_target(target, predicate, range(10), **timout_kwarg)
await retry_async.retry_target(target, predicate, range(10), **timeout_kwarg)


timeout_val = 10
# support "deadline" as an alias for "timeout"
timout_kwarg = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timout_kwarg = (
timeout_kwarg = (

Copy link
Contributor

@vchudnov-g vchudnov-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looking at this again to try to settle the breaking change determination. I have two additional code comments in the files.

On the breaking change determination:

  1. this PR makes streaming retries more reliable, so any code users previously had to manually enforce retries on streams will now be needed less or not at all, but it won't be broken by this change. Correct?
  2. The other change os that exceptions are propagated through the retry block, so users may now be seeing exceptions that they did not see before. Correct?

Item 1 by itself would suggest marking this non-breaking, though it may be good to call attention to the change so users can simplify parts of their code that may no longer be needed.
Item 2 seems like a breaking behavioral change that may surprise users.

I think we should flag this as a breaking change, and make it clear at the top of the PR description/merge-commit message that the only expected behavioral changes that may require action on the user's side are the new exceptions they may now receive from failing retries.

Thoughts? @daniel-sanche @parthea

"exponential_sleep_generator",
"if_exception_type",
"if_transient_error",
"_build_retry_error",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're exposing this, should we remove the leading private underscore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, removed

[List[Exception], RetryFailureReason, Optional[float]],
Tuple[Exception, Optional[Exception]],
] = _build_retry_error,
init_args: _P.args = (),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the the sync and async versions of retry_target_stream have init_args and init_kwargs but the unary retry_targets do not? Shouldn't we be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain that in the comments here

Basically, every time you retry a unary rpc, it will use the same arguments, so it makes sense to lock them in to a partial before calling retry_target

With streaming rpcs, retries are more complicated, because we'd like to be able to recover the stream at the point it broke, instead of from the beginning. But the way to do that is different for each rpc. It's out of scope for this PR, but I can imagine at some point wanting to add a ResumptionStrategy class, and provide some kind of call-back that allows you to modify requests between attempts. For that, we need to keep the method and it's call args separate when starting retry_target_stream

We could change retry_target to work the same way to be consistent, but that would definitely be a breaking change. And it's unnecessary, because unary retries don't have the same problems as streams

TL;DR: Using a partial for retry_target_stream would paint us into a corner, so I'm trying to make things flexible for future improvements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. I wonder whether we should a brief note to that effect in the documentation for the two parameters ("these two parameters are provided in case the user needs to change the arguments in subsequent retry attempts"). But we can defer that for now.

@daniel-sanche
Copy link
Contributor Author

  1. this PR makes streaming retries more reliable, so any code users previously had to manually enforce retries on streams will now be needed less or not at all, but it won't be broken by this change. Correct?
  2. The other change os that exceptions are propagated through the retry block, so users may now be seeing exceptions that they did not see before. Correct?

I don't believe anything should change for users for either of these points. This PR is adding the new RetryStreaming and RetryStreamingAsync classes to api_core, but nothing will change downstream until we make changes in the downstream gapic_generator to use them.

gapic_generator currently uses the unary retries.Retry object for all rpcs, and that won't change after this PR is merged. Once this functionality is in place, the next step would be looking into bringing it to gapic (but that will actually be a whole separate project on its own, because streaming rpcs will require different semi-custom resumption strategies. I'm not currently planning on taking this on myself, but it ties in with work Leah Cole has been doing in Node)

The main motivation of this change is making the retry streaming functionality available for veneer developers, until we figure out how to automate it for streaming rpcs in gapic libraries


The part that I think could be considered breaking is

  1. AsyncRetry used to use asyncio.wait_for to cancel the attempt at the deadline, but that caused race-condition issues, and wasn't consistent with how the other retries worked. I removed that as part of the refactoring we did to share logic between all retry classes

Our options are:

a. mark this as a breaking change
b. consider it a non-breaking change, since it is an 1) undocuemnt behaviour change to 2) fix a bug 3) that we don't have reason to believe would impact the workflow of many users
c. undo some of the recent refactoring to make async retries work the same as before, and address the race condition (and potential breaking change issues) in a different PR, so we can merge this one now

I'm fine with any option, just let me know which you both prefer

@vchudnov-g
Copy link
Contributor

OK, I see.

  • On point 1, we agree.
  • On point 2: for unaries, we were propagating the exceptions before, so no change there. And for streaming: these are new retry classes, so they can't break anyone
  • On point 3: The behavior changes, but the exceptions that are raised out of the new retry code are exceptions like RetryError that the user should already be handling, and the current user code should be ready to handle either type of exception mentioned in the race condition in async retries timeout race condition #528 . In other words, existing user code will not be broken, even if the particular exceptions that happen to be raised under a given set of circumstances would be different with or without the changes in this PR.

If this is an accurate summary, then I think calling this PR non-breaking is fine, which would result in this feat becoming a new minor release. @parthea WDYT? Would you like to do a pre-release for this minor release? I think we don't need to, but I'm totally open to being extra-cautious. FWIW, the existing tests, though refactored, all pass with this new code.

Separately, I'll file an issue to have gapic-generator-python use the new retry streaming classes you implemented.

@vchudnov-g
Copy link
Contributor

One point of clarification: the current GAPICs do use AsyncRetry (link)

@daniel-sanche daniel-sanche changed the title feat!: retry and retry_async support streaming rpcs feat: retry and retry_async support streaming rpcs Dec 11, 2023
AsyncGenerator["_Y", None], target_iterator
).athrow(
sys.exception() # type: ignore
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we could just use sys.exc_info()[1] to be compatible with all versions?

sys.exception() looks cleaner, but I don't think it's worth the extra noise for the next few years. (And there's a performance consideration to the version check too, although I don't think athrow should be called often)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a great idea. Let me try that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks. That worked. Merging.

@vchudnov-g vchudnov-g removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 12, 2023
@vchudnov-g vchudnov-g merged commit 17ff5f1 into main Dec 12, 2023
28 checks passed
@vchudnov-g vchudnov-g deleted the retry_generators branch December 12, 2023 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokoro:force-run Add this label to force Kokoro to re-run the tests. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry decorator should support generator functions
5 participants