Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better error handling to ai binding #2103

Merged
merged 1 commit into from May 13, 2024

Conversation

G4brym
Copy link
Member

@G4brym G4brym commented May 9, 2024

No description provided.

@G4brym G4brym requested review from a team as code owners May 9, 2024 15:55
@G4brym G4brym requested review from byule and fhanau May 9, 2024 15:55
Copy link
Collaborator

@fhanau fhanau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me. Since the provided error includes the name and description fields, is message less relevant here?

@G4brym G4brym force-pushed the improve-ai-errors-2 branch 5 times, most recently from 75baabe to 7b72794 Compare May 10, 2024 16:48

this.lastRequestId = res.headers.get("cf-ai-req-id");

if (inputs['stream']) {
if (!res.ok) {
throw new InferenceUpstreamError(await res.text());
throw await this._parseError(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?

}

return res.body;

} else {
if (!res.ok || !res.body) {
throw new InferenceUpstreamError(await res.text());
throw await this._parseError(res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how the customer is matching the strings, the main texts didn't change, for example if you are checking a match with err.message.includes('timeout') this will continue to work
Since this is including new information on the errors and not excluding, customers shouldn't be affected

Copy link
Contributor

@Cherry Cherry May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a breaking change then. If people are checking for a specific fulltext string error (via === or similar) and that changes, their application could behave in different ways after this update.

Are you certain no one is relying on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on what James said, although DO docs at least includes these errors, eg. here: https://developers.cloudflare.com/durable-objects/observability/troubleshooting/#durable-object-is-overloaded. In general I'd still consider altering the error message/type/parts a breaking change, as for DOs I'm sure I'm not the only one testing with a switch or ===.

Copy link

@celso celso May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a breaking change. Workers AI errors were largely undocumented and flaky so far, developers should def not be matching strings, we can't support that. This PR is part of larger effort to fix errors end to end,
one of the goals is to document the errors and create error codes, which then developers can match agaisn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cherry TBH, none of the errors you list seem to me like things that an application could reasonably respond to in any way other than report an error to the user. Maybe they could influence the app's decision on whether to retry, but I find it pretty hard to say which ones would be retryable based on the text... Personally I would probably just retry all of them or nothing depending on how common errors are...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the errors you listed above, are errors not caused by the user, and in this case the customers probably isn't matching strings against it, because he will probably want to retry the request regardless of the error message.

For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that ERROR 3007: Request timeout is a pretty good indication to retry, and ERROR 3015 Error with the model repository also seems to be one that's particularly transient and clears up after a retry. Obviously, none of this is documented, and therefore mostly anecdotal, but if folks have experimented with the API and hit these things, it's possible they're catching specific errors they have seen occur more than others.

Or, my code might specifically check for ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large, and then give the user a nicer message about what they uploaded.

For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'?

In a reasonable scenario, I don't think they would catch that specific error alone if they understand its cause. But they might do something like this:

const errorsToRety = [
	"ERROR 3025: Unknown internal error",
	"ERROR 3034: Unknown internal error",
	"ERROR 3040: Unknown internal error",
	"could not route request to AI model",
]

if (errorsToRety.includes(err.message)) {
	// Retry the request
}

after observing these as transient errors when calling the Workers AI API, and want to limit the errors their own users see. Of course you could argue they should retry all errors, but if they've observed ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large, or other non-transient errors, they might specifically be handling this case to report a problem to their own users.

I don't think this is an unlikely scenario if anyone has been using a product for a few months and responding to different errors they've observed in their own logging. But, I trust that you all are going to be most familiar with what your customers are doing on your API and will make the right call here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retryability in particular is an interesting case.

Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.

Of course, that's not the case today. But I would say that any retry logic like this is obviously a best-effort thing anyway. Breaking someone's retry heuristic doesn't really break their app. It might increase their rate of serving errors to end users slightly. I'd say if the only way that people are inspecting errors is to decide whether to retry, there's still room for some judgement call about how important it is to keep that logic working. (In making this judgment I would factor in how common errors are today and how commonly retries actually work.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, errors are relatively common with Workers AI: https://grafana.chaika.me/public-dashboards/82c159675de54d62b42e60e5238e28c5?orgId=1. Examples from Cloudflare's own devrel often includes retry logic with how common they are, in fact.

Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.

Agreed, this would definitely be the most ideal scenario.

src/cloudflare/internal/d1-api.ts Outdated Show resolved Hide resolved
src/cloudflare/internal/sockets.d.ts Outdated Show resolved Hide resolved
src/cloudflare/internal/vectorize.d.ts Outdated Show resolved Hide resolved
const content = (await res.json()) as AiError
return new InferenceUpstreamError(`${content.internalCode}: ${content.description}`, content.name);
} catch {
return new InferenceUpstreamError(await res.text());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey 👋,

@Cherry sent me this PR, and while browsing through I noticed that this wouldn't actually work. Even if the JSON parsing fails, such as on an empty response body, the body would be already consumed and this would throw again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, i've fixed the issue and added a unit test to cover it

@G4brym G4brym force-pushed the improve-ai-errors-2 branch 2 times, most recently from f63842f to f58baa5 Compare May 10, 2024 17:17
@edevil edevil merged commit c9c089c into cloudflare:main May 13, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants