Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL query string values should be redacted by default #961

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

trask
Copy link
Member

@trask trask commented Apr 24, 2024

Fixes #860

Changes

Query string values are now redacted by default due to concerns around leaking sensitive data.

This is not considered a breaking change because the OTel semantic conventions definition of stability specifically allows changing attribute values:

Things not listed in the above are not expected to remain stable via semantic convention and are allowed (or expected) to change. A few examples:

  • The values of attributes

docs/database/elasticsearch.md Outdated Show resolved Hide resolved
@trask trask marked this pull request as ready for review April 24, 2024 14:56
@trask trask requested review from a team as code owners April 24, 2024 14:56
model/registry/url.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it would also be great to update examples (or add new REDACTED ones)

docs/http/http-spans.md Outdated Show resolved Hide resolved
Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Left a nit suggestion about using the word SHOULD for instrumentations to provide option to not redact.

# your pull request title with [chore] or use the "Skip Changelog" label.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: bug_fix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that the specification allows this as a non-breaking change because it is an attribute value, but I am very concerned about the end user experience of this change. This change will break alerts/slos/boards that expect query parameters to be present in the attribute named url.full or url.query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially painful for users who are treating the query parameters as non-sensitive data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in practice this can be a pretty substantial breaking change for end users, most of whom likely don't have sensitive information in these query params

@austinlparker
Copy link
Member

I would be more comfortable with keeping the current language and providing an opt-in redaction processor so as not to break existing users.

@jsuereth
Copy link
Contributor

I think it's fair to leave the registry docs the same as they were and put the opt-out redaction as an augmented description for http semconv.

@Kielek
Copy link
Contributor

Kielek commented Apr 25, 2024

Is there any chance that you provide environmental variable name as an option to disable redaction? Without this I expect couple of different implementations in various languages.

docs/url/url.md Outdated
Comment on lines 42 to 43
**[3]:** Query string values SHOULD be redacted by default and replaced by the value `REDACTED`, e.g. `q=REDACTED&v=REDACTED` (the query string keys SHOULD be preserved).
Instrumentation MAY provide a configuration option to capture the full query string without any redaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make specific suggestions, I'd prefer this state that query string values MAY be redacted instead of SHOULD and going as far as stating that Instrumentation MUST provide a way to redact query string.

@svrnm
Copy link
Member

svrnm commented Apr 25, 2024

Is there any chance that you provide environmental variable name as an option to disable redaction? Without this I expect couple of different implementations in various languages.

Based on this point from @Kielek a broader proposal: we will run into questions around sensitivity again and again, and right now it is solved selectively via inline comments in the specification (there are other ones for enduser.* and db.* properties), while this needs a broader solution. I suggest that such a solution could be designed in a way that it is decoupled from the attribute requirements and by that "solves" this problem. Here is a rough outline:

  • The semantic convention is kept as is (here: collect url.query by default)
  • There are "privacy requirement profiles" that end-users can/must provide when running OpenTelemetry SDKs. There is a "non-production" profile that collects everything and is not applying any redaction/filtering/etc. There is a "production" profile that applies certain level of redaction/filtering, and there may be other profiles for stricter requirements (e.g. in regulated environments, or for regions with stronger privacy legislation, etc.)
  • Those profiles can be provided via an environment variable as @Kielek suggested.
  • The filtering is applied before the telemetry leaves via an exporter.

I anticipate that this comes with a set of downsides (SDKs need to implement that filtering,redaction,etc., / those processes may be resource incentive for a busy node / ...), but this is something that in my experience end-user will ask for eventually anyhow (at least from my indirect experience with APM agents).
Off-loading this to the collector may be the preferred solution, but depending on the privacy requirements that might not even be "good enough" (again, from my experience with APM agents, end-users in certain environments expect the data to be sanitized before leaving the application)

@lmolkova
Copy link
Contributor

lmolkova commented Apr 26, 2024

The semantic convention is kept as is (here: collect url.query by default)

It is important for native instrumentations to redact sensitive data or make users opt into the collection. For example, when writing logs, OTel is just one of possible logging providers and there are no guarantees secrets (if any) will be scrubbed by something in the pipeline.

I can see the world where:

  • instrumentations should redact by default. It's configurable
  • OTel SDK/distro has means to disable redaction depending on the profile it was able to detect

This allows instrumentations to protect themselves by having safe default (if my Azure SDK instrumentation leaked a credential, I'm responsible, not the OTel). If the distro has user permission or is brave enough, it can enable collection for all instrumentations.

@svrnm
Copy link
Member

svrnm commented Apr 26, 2024

The semantic convention is kept as is (here: collect url.query by default)

It is important for native instrumentations to redact sensitive data or make users opt into the collection. For example, when writing logs, OTel is just one of possible logging providers and there are no guarantees secrets (if any) will be scrubbed by something in the pipeline.

I can see the world where:

* instrumentations should redact by default. It's configurable

* OTel SDK/distro has means to disable redaction depending on the profile it was able to detect

This allows instrumentations to protect themselves by having safe default (if my Azure SDK instrumentation leaked a credential, I'm responsible, not the OTel). If the distro has user permission or is brave enough, it can enable collection for all instrumentations.

I disagree: instrumentation libraries and native instrumentations should not carry the responsibility to redact sensitive data or provide users to opt into the collection from an OpenTelemetry perspective. If they use other logging providers or if they provide APIs to other tracing/metric providers, they can (and may be somtimes have to) do their own redaction, but that's not what we (OpenTelemetry) should expect them to do.

I argue that filtering and redaction should be a responsibility of the otel SDK and in no way the instrumentation library or the API should be expected to do the filtering:

  • The opentelemetry API by default is "noop", that means if it carries the responsibility for filtering and redaction it is either done "for nothing" or the availability of the SDK needs to be checked. I think this situation is comparable to sampling. The API might provide some properties that can help to identify if telemetry needs to be redacted/filtered, but the SDK needs to do the work.
  • For the reasons stated above the instrumentation library should also not carry that responsibility (again, they can, but they shouldn't be expected to do so). The argument for "who is responsible" works in both directions: if the azure SDK leaks credentials you are responsible, but if all HTTP instrumentation libraries following the otel standard leak credentials, the otel community is responsible.
  • A weaker argument in the same context is that responsible library authors implementing opentelemetry might be more carless around that sensitive data and either not think about it or expect opentelemetry to treat that data properly.
  • At the end the responsibility to protect sensitive data lays by the end-user consuming the instrumentation library/native instrumentation and the OpenTelemetry API/SDK in their application. Furthermore they are the only ones who have a clear understanding of their privacy needs, e.g. a purely internal application may be fine with collecting all the telemetry, an app in a staging/pre-prod environment as well, but an application that runs somewhere in production needs more data to be redacted, and if the application is used in regulated environments (medical, governmental, EU:-P, ...) that need is even higher.

To double down on this: I think that sampling and redaction share a lot of requirements, and therefore redaction should live in the Tracing/Logging/Metrics/Profiling SDK. There may be different strategies for redaction as well and there is room for innovation (see probabilistic sampling) and with the development of technology (and here: changes of legislation) requirements may change independent of the semantic conventions and specification

model/registry/url.yaml Outdated Show resolved Hide resolved
@trask
Copy link
Member Author

trask commented Apr 26, 2024

@svrnm are you saying that database queries should also not be redacted by default?

also, this PR is about security (leaking credentials) and not PII data, so I believe it applies to both production and non-production environments

@svrnm
Copy link
Member

svrnm commented Apr 26, 2024

@svrnm are you saying that database queries should also not be redacted by default?

What I am saying is that the approach we take for sensitive data redaction needs to be approached differently. Right now we have places in the spec that say "redact this data" or "do not collect this data", but no guidance on where and how this data should be treated accordingly. Right now it seems that this is a responsibility of the instrumenting library, so each HTTP instrumentation library will need to manage the redaction of url.query and provide configuration for that. That does not seem right to me and also not to fit into end-user expectations. If my application uses 12 different libraries as dependency, HTTP Client, HTTP Server, DB Client, etc.) I am expected to provide 12 different configurations for data redaction?

also, this PR is about security (leaking credentials) and not PII data, so I believe it applies to both production and non-production environments

I understand, but at the end it is about both. I also understand that this issue here is urgent and might need to be merged regardless to mitigate the security bug, but it it's still not clear then who is responsible for complying with this default?

@austinlparker
Copy link
Member

I anticipate that this comes with a set of downsides (SDKs need to implement that filtering,redaction,etc., / those processes may be resource incentive for a busy node / ...), but this is something that in my experience end-user will ask for eventually anyhow (at least from my indirect experience with APM agents).

Yes, this. The SDK needs to build this in, and we should make a best effort to redact sensitive strings (we should probably have a way for instrumentations to communicate to the SDK what is sensitive, as well), but we cannot simply make a breaking change in one of the most widely used instrumentation paths like this, especially given that I think the majority of the values passed in through query strings in the world today are not sensitive.

I would suggest a staged rollout for these profiles as well; Perhaps initially we print a warning on initialization that query strings are passed without redaction and in the future this will change? Stage 2 would be an opt-in redaction processor, Stage 3 is redaction by default?

@reyang
Copy link
Member

reyang commented Apr 26, 2024

Some rough thinking:

  1. In many cases, being secure, being compliant and being able to observe things have conflict of interests (for example, URL query string values should be redacted by default #961 (comment)), the perfect solution might not exist, the balance keeps changing due to the changes in the surrounding environments (e.g. it changed a lot when GDPR came).
  2. For people coming from different industry/domain, the definition of "what is a good balance" itself is very subjective.
  3. The default behavior should set most users for success without them doing any extra things (e.g. the default "profile" will do proper redaction for the most common problems).
  4. For users who want to change the default behavior, OpenTelemetry should provide a consistent way instead of let the user struggle in a jungle of 20+ instrumentation libraries where each have different ways of doing redaction.
  5. There are also cases when the user wants to do special customization for a particular library rather than applying a rule to all the instrumentation libraries, or even a particular exporter path rather than all exporters (e.g. I want to redact the user email address when I exporter data to my normal logging backend, meanwhile I need to keep user email address for critical audit logs per requirement from the National Security Agency).
  6. Supporting 4 & 5 will take time due to the size of the scope and the fact that people with different background have different understanding/positions, meanwhile we can decide either to wait for now or do something accepting that we're not perfect (and probably will never be perfect), as long as we know that OpenTelemetry can evolve.

@reyang
Copy link
Member

reyang commented Apr 26, 2024

The SDK needs to build this in, and we should make a best effort to redact sensitive strings

This could also be a dangerous direction - for example, the pattern keeps changing when it comes "what looks like a credential leak", doing it inside the SDK has serviceability challenges (requiring the service to update dependency and redeploy). Either users don't use it, or they use it and have to do frequent patches in order to keep up with the latest rules, or we'll have to invent a system where these rules can be updated dynamically via some configuration.

@lmolkova
Copy link
Contributor

lmolkova commented Apr 26, 2024

Leaving the defaults and responsibilities aside for a moment.

I would be interested to learn from vendors/instrumentation authors on

  1. previous users complaints about sensitive data/potential leaks
  2. do they redact in their legacy products or otel-based offerings

My list of anecdotal evidence:

Random links from the CNCF blog
https://www.cncf.io/blog/2023/02/07/migrating-from-opentracing-to-opentelemetry/

Oxeye’s research team discovered several scenarios where sensitive data was leaked through tracing and telemetry collection within cloud-native applications. They found one that belonged to a leading online payment services company among the many deployments.

https://www.oxeye.io/resources/how-insecure-application-tracing-and-telemetry-may-lead-to-sensitive-data-and-pii-leakage

@austinlparker
Copy link
Member

austinlparker commented Apr 26, 2024

The SDK needs to build this in, and we should make a best effort to redact sensitive strings

This could also be a dangerous direction - for example, the pattern keeps changing when it comes "what looks like a credential leak", doing it inside the SDK has serviceability challenges (requiring the service to update dependency and redeploy). Either users don't use it, or they use it and have to do frequent patches in order to keep up with the latest rules, or we'll have to invent a system where these rules can be updated dynamically via some configuration.

I mean, it could be a regular expression (or even some simple heurestics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I would point out that Datadog offers two options - redact query string, and redact paths with digits (https://docs.datadoghq.com/tracing/configure_data_security/?tab=http#trace-obfuscation) but they are both disabled by default. In addition, as other comments have pointed out, we ship a system for redaction today, it's in the collector, and our official guidance on production deploys is to use a collector.

I am not arguing that we shouldn't have a redaction option, I am arguing that we should not default it to 'on'. I would point out that this proposed new behavior was opposed by comments from the community on the .NET repo where it originated (open-telemetry/opentelemetry-dotnet#5532 (comment)).

edit: another example from the datadog docs, specifically around their library: https://docs.datadoghq.com/tracing/configure_data_security/?tab=http#library. they redact 'suspicious-looking' values through a regex, the default of which is below:

(?:(?:"|%22)?)(?:(?:old[-_]?|new[-_]?)?p(?:ass)?w(?:or)?d(?:1|2)?|pass(?:[-_]?phrase)?|secret|(?:api[-_]?|private[-_]?|public[-_]?|access[-_]?|secret[-_]?|app(?:lication)?[-_]?)key(?:[-_]?id)?|token|consumer[-_]?(?:id|key|secret)|sign(?:ed|ature)?|auth(?:entication|orization)?)(?:(?:\s|%20)*(?:=|%3D)[^&]+|(?:"|%22)(?:\s|%20)*(?::|%3A)(?:\s|%20)*(?:"|%22)(?:%2[^2]|%[^2]|[^"%])+(?:"|%22))|(?:bearer(?:\s|%20)+[a-z0-9._\-]+|token(?::|%3A)[a-z0-9]{13}|gh[opsu]_[0-9a-zA-Z]{36}|ey[I-L](?:[\w=-]|%3D)+\.ey[I-L](?:[\w=-]|%3D)+(?:\.(?:[\w.+/=-]|%3D|%2F|%2B)+)?|-{5}BEGIN(?:[a-z\s]|%20)+PRIVATE(?:\s|%20)KEY-{5}[^\-]+-{5}END(?:[a-z\s]|%20)+PRIVATE(?:\s|%20)KEY(?:-{5})?(?:\n|%0A)?|(?:ssh-(?:rsa|dss)|ecdsa-[a-z0-9]+-[a-z0-9]+)(?:\s|%20|%09)+(?:[a-z0-9/.+]|%2F|%5C|%2B){100,}(?:=|%3D)*(?:(?:\s|%20|%09)+[a-z0-9._-]+)?)

@cartermp
Copy link
Contributor

@lmolkova

I would be interested to learn from vendors/instrumentation authors on

My experience in the past ~3 years in my capacity working for a vendor (Honeycomb) has been a mix:

  • Vast majority of customers have no issues with this at all. They recognize that debugging information means there may be sensitive information, including possible credentials. SOC II compliance, encryption at rest and in transit, the ability to sign a DPA, and auditable, proper least-privilege controls for their data are all table stakes.
  • Most customers who operate on data with a degree of sensitivity typically have their own, internal mechanisms to ensure nothing bad leaks. Things still leak of course, and when it happens they ask us to delete the data at the requested level of granularity. Note that the 60-day retention period of data (after which data is deleted permanently) alleviates many of these concerns by default.
  • Enterprise customers dealing with this kind of stuff -- when they know it's "a thing" -- ask what their options are for data redaction as a last-step measure before egress from their own network at various stages, but it's typically before signing up to instrument beyond a POC. Thus far, the redactionprocessor has been sufficient for all but one customer.
  • One customer is extremely careful about privacy and has built their own encryption proxy that ensures no real names ever leave their network, and uses another tool to decrypt values in the browser when using Honeycomb.
  • I have spoken with one customer who was concerned about how their URL representations could leak sensitive information, both in query params and in the URL itself. They acknowledged this is very much a "we have a lot of work to do to clean things up ourselves" situation, and wanted to make sure there was some consumable information about how to redact information, including how to configure specific instrumentations so that sensitive data doesn't get leaked to begin with.

@austinlparker
Copy link
Member

  • Vast majority of customers have no issues with this at all. They recognize that debugging information means there may be sensitive information, including possible credentials. SOC II compliance, encryption at rest and in transit, the ability to sign a DPA, and auditable, proper least-privilege controls for their data are all table stakes.

I would also add that in five years at Lightstep I don't recall anyone having an issue with this. Any time we did have a customer send PII or other sensitive data, we had ways for them to delete it, and we offered redaction as a part of our collection pipeline for known sensitive keys/strings.

@lmolkova
Copy link
Contributor

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this.

Could we all agree with something like:

  • this set of query params (other properties) is safe, they should be on by default
  • this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)
  • For the rest, we want SDK/distro to be able to apply a common default

?

@austinlparker
Copy link
Member

austinlparker commented Apr 26, 2024

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this.

Could we all agree with something like:

  • this set of query params (other properties) is safe, they should be on by default
  • this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)
  • For the rest, we want SDK/distro to be able to apply a common default

?

I would support this.

@trask
Copy link
Member Author

trask commented Apr 26, 2024

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this.
Could we all agree with something like:

  • this set of query params (other properties) is safe, they should be on by default
  • this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)
  • For the rest, we want SDK/distro to be able to apply a common default

?

I would support this.

I've sent a PR to motivate further discussion of this proposal: #971

Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 12, 2024
@svrnm
Copy link
Member

svrnm commented May 13, 2024

Commenting to unstale

@trask trask removed the Stale label May 13, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Http client and server span default collection behavior for url.full and url.query attributes