Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Client spans - Span Status for 4xx errors - Change from MUST to SHOULD #1003

Open
tj-cisco opened this issue May 3, 2024 · 18 comments
Open
Assignees
Labels
bug Something isn't working triage:needs-triage

Comments

@tj-cisco
Copy link

tj-cisco commented May 3, 2024

Area(s)

No response

Is your change request related to a problem? Please describe.

Currently the convention states that all 4xx errors in a client span MUST have their status set to ERROR. This introduces high telemetry span failure rates in REST APIs that use 4xx codes to represent things that are not failures in the context of telemetry. For example in a CRUD API a 404 response to a GET for a resource is a perfectly normal response code that is not indicative of an error but communicates to the client that the record they are looking for does not exist and and should create it if they require it to exist. Another common use is 429 to enforce rate limiting, this may or may not be considered an error depending on the API consumer implementation, it should be left to the API consumer to generate an error span if they consider rate limiting an error.

Describe the solution you'd like

Change MUST to SHOULD in the following sentence for client spans:

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.

Resulting in:

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and SHOULD be set to Error in case of SpanKind.CLIENT.

Describe alternatives you've considered

No response

Additional context

No response

@gregkalapos
Copy link
Member

I think SemConv should be driven by the underlying spec, in this case by the RFC that defines the status codes, which says that 4xx are client side errors:

Client Error 4xx
The 4xx (Client Error) class of status code indicates that the client
seems to have erred.

That is the only thing that globally applies to all HTTP requests. The use case that you raise (e.g. CRUD API not considering a 404 as an error) is I'd say a higher level interpretation of the status code - from the HTTP libraries point of view it's still an error, but the higher level application layer doesn't interpret it as an error, because it can properly handle it. But from HTTP's point of view it's still an error.

I think this would be better served by filtering in the backend - if you list errors, and your application doesn't interpret 404s as errors, then you could just filter those out from the list of errors. The backend is a place where you have the option to defines which specific 4xx client side HTTP spans are not errors from your application's point of view, and then slice and dice traces accordingly. I don't the SemConv can accommodate this.

@tj-cisco
Copy link
Author

tj-cisco commented May 6, 2024

I think using should vs must and having a default of error for implementations is a good starting point. However 4xx responses are incredibly gray when it comes to 'is that an error', and a lot of this comes from the use context. In the perspective of a rest client the api/client/developer may defined the error status as:

  • 401 - Unknown (no status, subject to next request?)
  • 403 - Error
  • 404 - OK
  • 405 - Error
  • 408 - Error
  • 429 - Unknown, potentially an error if it's happening at a high rate

However in the context of a ui/browser everything except 401 would be considered an error.

This request is to allow client span generators to add additional context to the span.

Below is the Elastic APM interpretation of client spans and their default status, they allow for interpretation of a 4xx response not being an error.

For HTTP spans (from the client perspective), the span's outcome should be set to "success" if the status code is lower than 400 and to "failure" otherwise.

https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-http.md#outcome

@gregkalapos
Copy link
Member

Below is the Elastic APM interpretation of client spans and their default status, they allow for interpretation of a 4xx response not being an error.

For HTTP spans (from the client perspective), the span's outcome should be set to "success" if the status code is lower than 400 and to "failure" otherwise.

https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-http.md#outcome

That spec is used by all Elastic APM Agents, and basically all set client spans with HTTP 4xx to failure - that should is actually a must 🙃 . What I know for sure is that there is no Elastic APM agent that would accommodate the custom interpretation of 4xx status codes - all 4xx are treated the same.

This request is to allow client span generators to add additional context to the span.

Yeah, I understand the intention - I just feel it'd make SemConv less useful and most instrumentations won't benefit. But that's just my opinion - let's see what others think.

Are there any widely used libraries where you could have a custom interpretation (like the list you had above) of the different 4xx status codes?

@lmolkova lmolkova added bug Something isn't working and removed enhancement New feature or request labels May 8, 2024
@lmolkova
Copy link
Contributor

lmolkova commented May 8, 2024

Regardless of the default, it won't work for some applications. 404 is an error in some cases and not an error in others, sometimes depending on the caller context.

But I agree that a MUST is not right. Specialized or opinionated instrumentations should be able to use different classification of errors and users are encouraged to override status in the processors (UPDATE but currently impossible #1003 (comment) 🤦‍♀️).

I'd support SHOULD and would consider it as a bug fix. It would also not change the behavior of any OTel HTTP instrumentations.

@tj-cisco would you be able to send a PR?

@jjatria
Copy link

jjatria commented May 8, 2024

I've been following this conversation since this started as a feature request on the Perl OTel instrumentation I maintain. At the risk of sounding off-topic, I have a question about the expected consequences of this change.

@lmolkova says that a change like the one proposed here would "not change the behaviour of any OTel HTTP instrumentation", and I agree: even with a change like this one, I don't think I'd have accepted a change like the one that originally lead to this conversation, at least not without some additional clarification on the spec.

But this made me think twice:

users are encouraged to override status in the processors

I like this idea, but I'm not sure how the specification supports this. Unless I'm reading it incorrectly, any change in status would happen in the processor's onEnd method, but that is called "after a span is ended". And as per the documentation for the span's IsRecording method, "after a Span is ended, it SHOULD become non-recording", at which point "all [the data provided by methods like setStatus] is discarded right away".

So the processor can change the status, but the span will not be recording at that point, so that change will be discarded. Or did I get this wrong? I'd be happy to move this to a different thread if it makes things easier to manage.

@jsuereth
Copy link
Contributor

jsuereth commented May 8, 2024

My take here -

We can give users flexibility in the Status Code -> Span Status mapping.

However, at a macro level -

  1. Server-side spans should only default failure on 5xx
  2. Client-side spans can default to failure on 5xx/4xx, but can allow configuration to denote certain 4xx as not errors (so users can control this behavior).
  3. Clients that have retry / correction on spans would NOT mark an outer/internal span as failed automatically if a client-side span fails. Interpreting a trace, if you see a failed client span owned by a non-failed internal or service span, you can infer the client recovered from whatever failure it received (like rate limiting, etc.). I think it's still helpful to know if you're seeing an inordinate amount of rate-limited client calls, e.g.

That infers the following:

  • Server spans should NEVER be marked as failure on 4xx - this is legit behavior from illegitimate traffic.
  • Vendors / Users of client-side spans should look at the overall trace to understand overall failure vs. a single span.

Overall I'm ok moving from MUST to SHOULD, but I still think what we have now is reasonable.

@austinlparker
Copy link
Member

As long as we're not changing the behavior of existing emitted telemetry, then this seems fine as a bug fix.

@lmolkova
Copy link
Contributor

lmolkova commented May 8, 2024

I like this idea, but I'm not sure how the specification supports this. Unless I'm reading it incorrectly, any change in status would happen in the processor's onEnd method, but that is called "after a span is ended". And as per the documentation for the span's IsRecording method, "after a Span is ended, it SHOULD become non-recording", at which point "all [the data provided by methods like setStatus] is discarded right away".

@jjatria oh snap, you're absolutely right and there is no way to change span status in the processor. I think the best thread to discuss it in is open-telemetry/opentelemetry-specification#1089

@lmolkova
Copy link
Contributor

lmolkova commented May 8, 2024

Clients that have retry / correction on spans would NOT mark an outer/internal span as failed automatically if a client-side span fails. Interpreting a trace, if you see a failed client span owned by a non-failed internal or service span, you can infer the client recovered from whatever failure it received (like rate limiting, etc.). I think it's still helpful to know if you're seeing an inordinate amount of rate-limited client calls, e.g.

@jsuereth it still seems to be problematic and we get A LOT of negative user feedback about this Azure/azure-sdk-for-net#29132

if users do something like createIfNotExist they don't want to see 404 as an error. It still confuses them and automated tooling that blindly shows error rates.

The perfect solution would be that client libraries/application would be able to dynamically decide if inner HTTP span status is an error, but we're really far from it.

TL;DR: The MUST is a problem because even client libs which do HTTP instrumentation and know 404 is not an error can't classify it as such.

@gregkalapos
Copy link
Member

I'd expand on the comment from @jjatria about the already opened PR for perl: I totally acknowledge the use-case described here, and it's ok if we try to make SemConv allow for those use-cases.

The biggest issue I see with just making it from MUST to SHOULD is exactly PRs like the perl PR above: by doing so, anyone could open a PR against any of the well known general purpose HTTP libraries and change the meaning of all 4xx status codes from error to non-error. That would be totally compliant with SemConv since it's just a SHOULD, but at the same time that would open up the possibility that instrumentations of different general purpose HTTP libraries may end up behaving differently (some accept such PR, some don't). I don't think that would be ideal.

If people agree that this is a potential risk we care about, then maybe a suggestion: We could slice the spec into 2 parts:

  • Base case where we follow HTTP conventions because we don't know anything about the higher layer. This would apply to the general purpose HTTP libraries and cover PRs like the one above. For such cases it's still a MUST. This will apply to most instrumentations.
  • And then specialized use-cases where the instrumentation knows the meaning of the HTTP response codes from the application's point of view. In that case, SHOULD is fine and that part would be where we accommodate for the use-case discussed here.

@jjatria
Copy link

jjatria commented May 9, 2024

I confess I don't know much about the full downstream consequences of a span being marked as OK or as an error, but if the problem is as @lmolkova states that

[users] don't want to see 404 as an error. It still confuses them and automated tooling that blindly shows error rates.

then I feel like this is not a problem with the spans, but with the way the tooling around them interprets them.

I think on this I side with @jsuereth in that

Vendors / Users of client-side spans should look at the overall trace to understand overall failure vs. a single span.

I don't think the MUST is a bug, but don't object to it being a SHOULD. However, I'm not convinced that making the change suggested here would have the consequences that motivate it.

@pyohannes
Copy link
Contributor

if users do something like createIfNotExist they don't want to see 404 as an error. It still confuses them and automated tooling that blindly shows error rates.

Is it our aim to support automated tooling that blindly shows error rates? It definitely should be our aim to provide clarity to confused users, but there might be other ways to do that besides putting a band-aid on HTTP semantic conventions.

The problem is more general: an error might happen at any layer in an application, and it can be handled successfully on a different layer (or even be expected by that different layer). That should be clear from looking at a trace: in the example @lmolkova gave some HTTP operations finished with an error, while the parent createIfNotExist operation finished successfully.

Confusion on the user side might come from the notion that any errors happening in a trace are problematic. If we want to support this notion, than fixes like this are the logical consequence.

I think solving this problem on the instrumentation level will only work for a few cases, as it's unrealistic to expect that the instrumented library can know whether an error is expected or handled by any application layer above it.

@lmolkova
Copy link
Contributor

lmolkova commented May 9, 2024

I don't want to dismiss overwhelming user feedback.

The retried connection drop during an operation is fine to mark as an error, 429 or 503 is fine too. What my example shows that some 'error' are not errors.

The 404 during createIfNotExist is not an error. What happens now that spec does not let me change it.
We don't need to solve the bigger problem to fix it.

@trask
Copy link
Member

trask commented May 9, 2024

The biggest issue I see with just making it from MUST to SHOULD is exactly PRs like the perl PR above: by doing so, anyone could open a PR against any of the well known general purpose HTTP libraries and change the meaning of all 4xx status codes from error to non-error. That would be totally compliant with SemConv since it's just a SHOULD, but at the same time that would open up the possibility that instrumentations of different general purpose HTTP libraries may end up behaving differently (some accept such PR, some don't).

@gregkalapos does #281 address your concerns?

@tj-cisco
Copy link
Author

tj-cisco commented May 9, 2024

https://github.com/elastic/apm/blob/main/specs/agents/tracing-instrumentation-http.md#outcome

That spec is used by all Elastic APM Agents, and basically all set client spans with HTTP 4xx to failure - that should is actually a must 🙃 . What I know for sure is that there is no Elastic APM agent that would accommodate the custom interpretation of 4xx status codes - all 4xx are treated the same.

In this case the documentation must be updated with rfc2119 in mind.

The biggest issue I see with just making it from MUST to SHOULD is exactly PRs like the perl PR above: by doing so, anyone could open a PR against any of the well known general purpose HTTP libraries and change the meaning of all 4xx status codes from error to non-error. That would be totally compliant with SemConv since it's just a SHOULD, but at the same time that would open up the possibility that instrumentations of different general purpose HTTP libraries may end up behaving differently (some accept such PR, some don't). I don't think that would be ideal.

My PR here was the incorrect approach, if this change get accepted then I will create a way to pass a mapping of response codes and what the span status should be. I agree that blindly changing the all responses to be none errors is the wrong solution.

The 404 during createIfNotExist is not an error. What happens now that spec does not let me change it. We don't need to solve the bigger problem to fix it.

404 not being an error seems to be a fairly consistent theme here:

elastic/apm-agent-python#1760
https://discuss.elastic.co/t/apm-capture-as-error-elasticsearch-get-404-on-client/248121/2

It seems that people are really not interested in seeing the noise in the reporting tools from none errors being reported as errors and the related alert fatiuge associated with them.

@gregkalapos
Copy link
Member

The biggest issue I see with just making it from MUST to SHOULD is exactly PRs like the perl PR above: by doing so, anyone could open a PR against any of the well known general purpose HTTP libraries and change the meaning of all 4xx status codes from error to non-error. That would be totally compliant with SemConv since it's just a SHOULD, but at the same time that would open up the possibility that instrumentations of different general purpose HTTP libraries may end up behaving differently (some accept such PR, some don't). I don't think that would be ideal.

My PR here was the incorrect approach, if this change get accepted then I will create a way to pass a mapping of response codes and what the span status should be. I agree that blindly changing the all responses to be none errors is the wrong solution.

I completely agree with what you say here. At the same time, I'd still remind that just by changing from MUST to SHOULD you open up the door for PRs like your original one (that is blindly changing the all responses to be none errors as you say) and it'll be totally fine to accept that PR in some of the libraries and reject in others (since it'd be based on aSHOULD). That would make SemConv less useful, because there won't be any harmonisation on how HTTP errors are interpreted. That is why I suggest to make the spec more precise (details above) - that'd still enable what you suggest for this special case, but it'd be still totally clear for the base case, when the library does not have any info about a higher layer.

@gregkalapos does #281 address your concerns?

Not sure about that. From what I understand it's still not documented and there is no tooling around it forcing instrumentations to document when they are not following a SHOULD from the spec. So I still think the end result would be different libraries reporting errors differently based on status codes and for a user to understand that, the only way would be to look into every single instrumentation (or look at its GitHub and look for issues/PRs).

I 💯% agree with @jjatria's comment:

However, I'm not convinced that making the change suggested here would have the consequences that motivate it.

My understanding is this: Some 4xxs are not always errors and in some cases, instrumentation has information about this which could be used to set or not set the span as an error (this I'd call edge-case which the spec could accommodate). In other cases (which I'd call base-case), the instrumentation does not have this info - so best we can do is to follow the HTTP RFC and here 4xx MUST be an error. This way instrumentations will still behave similarly.

So why don't we just enable to use this information in the spec?

@tj-cisco
Copy link
Author

My understanding is this: Some 4xxs are not always errors and in some cases, instrumentation has information about this which could be used to set or not set the span as an error (this I'd call edge-case which the spec could accommodate). In other cases (which I'd call base-case), the instrumentation does not have this info - so best we can do is to follow the HTTP RFC and here 4xx MUST be an error. This way instrumentations will still behave similarly.

You have 2 conflicting statements here, you can't use the word MUST and allow the instrumentation to set it to something that is not included in the MUST.

So why don't we just enable to use this information in the spec?

This is what this whole issue is about, changing MUST to SHOULD will allow this. Alternatively it could be written in one of the 2 following ways:

For a 4xx response, the span status MAY be set to ok/unknown if the instrumentation knows that response code is not an error, in all other cases it MUST be set to error.

For a 4xx response code, the span status SHOULD be set to error. The instrumentation MAY choose set the span status to ok/unknown if it knows the response code is known not to be an error.

IMHO having MAY and MUST in the same definition is confusing, MAY and SHOULD are far more compatible terms.

@gregkalapos
Copy link
Member

I think both of your proposals here are fine - both accommodate what you suggest here and don't contaminate the base case.

This is what this whole issue is about, changing MUST to SHOULD will allow this.

And would allow many other things - e.g. it'd also allow in a specific implementation to not set any of the 4xxs to error.

My suggestion was to add the info into the spec that it's ok to not set 4xxs to errors IF the instrumentation has the knowledge that it's in fact not an error and not merge this into the base case.

Let's see what others think - I think both of your proposals above are fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:needs-triage
Projects
None yet
Development

No branches or pull requests

8 participants