Add better error handling to ai binding #2103

G4brym · 2024-05-09T15:55:08Z

No description provided.

fhanau

The changes make sense to me. Since the provided error includes the name and description fields, is message less relevant here?

Cherry · 2024-05-10T17:07:30Z

src/cloudflare/internal/ai-api.ts


    this.lastRequestId = res.headers.get("cf-ai-req-id");

    if (inputs['stream']) {
      if (!res.ok) {
-        throw new InferenceUpstreamError(await res.text());
+        throw await this._parseError(res)


Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?

Cherry · 2024-05-10T17:07:33Z

src/cloudflare/internal/ai-api.ts

      }

      return res.body;

    } else {
      if (!res.ok || !res.body) {
-        throw new InferenceUpstreamError(await res.text());
+        throw await this._parseError(res)


Has backwards compat been considered here? Are there any possible cases where people would previously be expecting a specific error string (to retry, log, etc.) and this might now change?

It depends on how the customer is matching the strings, the main texts didn't change, for example if you are checking a match with err.message.includes('timeout') this will continue to work
Since this is including new information on the errors and not excluding, customers shouldn't be affected

That sounds like a breaking change then. If people are checking for a specific fulltext string error (via === or similar) and that changes, their application could behave in different ways after this update.

Are you certain no one is relying on this?

+1 on what James said, although DO docs at least includes these errors, eg. here: https://developers.cloudflare.com/durable-objects/observability/troubleshooting/#durable-object-is-overloaded. In general I'd still consider altering the error message/type/parts a breaking change, as for DOs I'm sure I'm not the only one testing with a switch or ===.

I don't think this is a breaking change. Workers AI errors were largely undocumented and flaky so far, developers should def not be matching strings, we can't support that. This PR is part of larger effort to fix errors end to end,
one of the goals is to document the errors and create error codes, which then developers can match agaisn't.

@Cherry TBH, none of the errors you list seem to me like things that an application could reasonably respond to in any way other than report an error to the user. Maybe they could influence the app's decision on whether to retry, but I find it pretty hard to say which ones would be retryable based on the text... Personally I would probably just retry all of them or nothing depending on how common errors are...

All the errors you listed above, are errors not caused by the user, and in this case the customers probably isn't matching strings against it, because he will probably want to retry the request regardless of the error message.

For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'?

I would argue that ERROR 3007: Request timeout is a pretty good indication to retry, and ERROR 3015 Error with the model repository also seems to be one that's particularly transient and clears up after a retry. Obviously, none of this is documented, and therefore mostly anecdotal, but if folks have experimented with the API and hit these things, it's possible they're catching specific errors they have seen occur more than others.

Or, my code might specifically check for ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large, and then give the user a nicer message about what they uploaded.

For example, and lets be reasonable, why would a customer match exactly this string err.message === 'ERROR 3034: Unknown internal error'?

In a reasonable scenario, I don't think they would catch that specific error alone if they understand its cause. But they might do something like this:

const errorsToRety = [ "ERROR 3025: Unknown internal error", "ERROR 3034: Unknown internal error", "ERROR 3040: Unknown internal error", "could not route request to AI model", ] if (errorsToRety.includes(err.message)) { // Retry the request }

after observing these as transient errors when calling the Workers AI API, and want to limit the errors their own users see. Of course you could argue they should retry all errors, but if they've observed ERROR 3010: Invalid or incomplete input for the model: failed to decode JSON: Request is too large, or other non-transient errors, they might specifically be handling this case to report a problem to their own users.

I don't think this is an unlikely scenario if anyone has been using a product for a few months and responding to different errors they've observed in their own logging. But, I trust that you all are going to be most familiar with what your customers are doing on your API and will make the right call here.

Retryability in particular is an interesting case.

Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.

Of course, that's not the case today. But I would say that any retry logic like this is obviously a best-effort thing anyway. Breaking someone's retry heuristic doesn't really break their app. It might increase their rate of serving errors to end users slightly. I'd say if the only way that people are inspecting errors is to decide whether to retry, there's still room for some judgement call about how important it is to keep that logic working. (In making this judgment I would factor in how common errors are today and how commonly retries actually work.)

For reference, errors are relatively common with Workers AI: https://grafana.chaika.me/public-dashboards/82c159675de54d62b42e60e5238e28c5?orgId=1. Examples from Cloudflare's own devrel often includes retry logic with how common they are, in fact.

Frankly, we should design our APIs to retry automatically if it's safe and effective to do so. If we do that, then apps can assume any API error is not retryable.

Agreed, this would definitely be the most ideal scenario.

src/cloudflare/internal/d1-api.ts

src/cloudflare/internal/sockets.d.ts

src/cloudflare/internal/vectorize.d.ts

DaniFoldi · 2024-05-10T17:09:57Z

src/cloudflare/internal/ai-api.ts

+        const content = (await res.json()) as AiError
+        return new InferenceUpstreamError(`${content.internalCode}: ${content.description}`, content.name);
+      } catch {
+        return new InferenceUpstreamError(await res.text());


Hey 👋,

@Cherry sent me this PR, and while browsing through I noticed that this wouldn't actually work. Even if the JSON parsing fails, such as on an empty response body, the body would be already consumed and this would throw again.

Thanks for pointing that out, i've fixed the issue and added a unit test to cover it

G4brym requested review from a team as code owners May 9, 2024 15:55

G4brym requested review from byule and fhanau May 9, 2024 15:55

fhanau reviewed May 10, 2024

View reviewed changes

jasnell approved these changes May 10, 2024

View reviewed changes

G4brym force-pushed the improve-ai-errors-2 branch 5 times, most recently from 75baabe to 7b72794 Compare May 10, 2024 16:48

Cherry reviewed May 10, 2024

View reviewed changes

DaniFoldi reviewed May 10, 2024

View reviewed changes

G4brym force-pushed the improve-ai-errors-2 branch 2 times, most recently from f63842f to f58baa5 Compare May 10, 2024 17:17

Add better error handling to ai binding

2a249c0

G4brym force-pushed the improve-ai-errors-2 branch from f58baa5 to 2a249c0 Compare May 10, 2024 17:40

edevil merged commit c9c089c into cloudflare:main May 13, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better error handling to ai binding #2103

Add better error handling to ai binding #2103

G4brym commented May 9, 2024

fhanau left a comment

Cherry May 10, 2024

Cherry May 10, 2024

G4brym May 13, 2024

Cherry May 13, 2024 •

edited

DaniFoldi May 13, 2024

celso May 13, 2024 •

edited

kentonv May 13, 2024

G4brym May 13, 2024

Cherry May 13, 2024

kentonv May 13, 2024

Cherry May 13, 2024

DaniFoldi May 10, 2024

G4brym May 11, 2024

Add better error handling to ai binding #2103

Add better error handling to ai binding #2103

Conversation

G4brym commented May 9, 2024

fhanau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cherry May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

celso May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cherry May 13, 2024 •

edited

celso May 13, 2024 •

edited