New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add better error handling to ai binding #2103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me. Since the provided error includes the name
and description
fields, is message
less relevant here?
75baabe
to
7b72794
Compare
|
||
this.lastRequestId = res.headers.get("cf-ai-req-id"); | ||
|
||
if (inputs['stream']) { | ||
if (!res.ok) { | ||
throw new InferenceUpstreamError(await res.text()); | ||
throw await this._parseError(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?
} | ||
|
||
return res.body; | ||
|
||
} else { | ||
if (!res.ok || !res.body) { | ||
throw new InferenceUpstreamError(await res.text()); | ||
throw await this._parseError(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how the customer is matching the strings, the main texts didn't change, for example if you are checking a match with err.message.includes('timeout')
this will continue to work
Since this is including new information on the errors and not excluding, customers shouldn't be affected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a breaking change then. If people are checking for a specific fulltext string error (via ===
or similar) and that changes, their application could behave in different ways after this update.
Are you certain no one is relying on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on what James said, although DO docs at least includes these errors, eg. here: https://developers.cloudflare.com/durable-objects/observability/troubleshooting/#durable-object-is-overloaded. In general I'd still consider altering the error message/type/parts a breaking change, as for DOs I'm sure I'm not the only one testing with a switch
or ===
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a breaking change. Workers AI errors were largely undocumented and flaky so far, developers should def not be matching strings, we can't support that. This PR is part of larger effort to fix errors end to end,
one of the goals is to document the errors and create error codes, which then developers can match agaisn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cherry TBH, none of the errors you list seem to me like things that an application could reasonably respond to in any way other than report an error to the user. Maybe they could influence the app's decision on whether to retry, but I find it pretty hard to say which ones would be retryable based on the text... Personally I would probably just retry all of them or nothing depending on how common errors are...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the errors you listed above, are errors not caused by the user, and in this case the customers probably isn't matching strings against it, because he will probably want to retry the request regardless of the error message.
For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that ERROR 3007: Request timeout
is a pretty good indication to retry, and ERROR 3015 Error with the model repository
also seems to be one that's particularly transient and clears up after a retry. Obviously, none of this is documented, and therefore mostly anecdotal, but if folks have experimented with the API and hit these things, it's possible they're catching specific errors they have seen occur more than others.
Or, my code might specifically check for ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large
, and then give the user a nicer message about what they uploaded.
For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'?
In a reasonable scenario, I don't think they would catch that specific error alone if they understand its cause. But they might do something like this:
const errorsToRety = [
"ERROR 3025: Unknown internal error",
"ERROR 3034: Unknown internal error",
"ERROR 3040: Unknown internal error",
"could not route request to AI model",
]
if (errorsToRety.includes(err.message)) {
// Retry the request
}
after observing these as transient errors when calling the Workers AI API, and want to limit the errors their own users see. Of course you could argue they should retry all errors, but if they've observed ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large
, or other non-transient errors, they might specifically be handling this case to report a problem to their own users.
I don't think this is an unlikely scenario if anyone has been using a product for a few months and responding to different errors they've observed in their own logging. But, I trust that you all are going to be most familiar with what your customers are doing on your API and will make the right call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retryability in particular is an interesting case.
Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.
Of course, that's not the case today. But I would say that any retry logic like this is obviously a best-effort thing anyway. Breaking someone's retry heuristic doesn't really break their app. It might increase their rate of serving errors to end users slightly. I'd say if the only way that people are inspecting errors is to decide whether to retry, there's still room for some judgement call about how important it is to keep that logic working. (In making this judgment I would factor in how common errors are today and how commonly retries actually work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, errors are relatively common with Workers AI: https://grafana.chaika.me/public-dashboards/82c159675de54d62b42e60e5238e28c5?orgId=1. Examples from Cloudflare's own devrel often includes retry logic with how common they are, in fact.
Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.
Agreed, this would definitely be the most ideal scenario.
src/cloudflare/internal/ai-api.ts
Outdated
const content = (await res.json()) as AiError | ||
return new InferenceUpstreamError(`${content.internalCode}: ${content.description}`, content.name); | ||
} catch { | ||
return new InferenceUpstreamError(await res.text()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey 👋,
@Cherry sent me this PR, and while browsing through I noticed that this wouldn't actually work. Even if the JSON parsing fails, such as on an empty response body, the body would be already consumed and this would throw again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out, i've fixed the issue and added a unit test to cover it
f63842f
to
f58baa5
Compare
f58baa5
to
2a249c0
Compare
No description provided.