New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add `process.entity_id` to attribute registry #983

Closed

mjwolf wants to merge 8 commits into open-telemetry:main from mjwolf:process-entity_id

Contributor

mjwolf commented Apr 29, 2024

Changes

Add process.entity_id attribute, from the ECS field of the same name.

entity_id is a globally unique identifier for a process. The existing attributes may no be sufficent to uniquely identify a process; PIDs can be re-used, and other attributes, like start time, might collide, especially when VMs running identical images are started at the same time.

entity_id could be any random or application-generated UUID, but it would be preferable to use process derived values to generate, when possible. By using process attributes, and a shared algorithm, different collectors running on the same system and observing the same process would generate the same ID, and users could later know that it was indeed the same process seen by both observers.

These are some notes on the recommended method I have here:

AFAIK, only Linux has unique boot IDs. Windows and macos have machine IDs, which could used in a similar way, but some OSs restrict reading the machine ID, to prevent user tracking. Without having a unique boot id, the PID and process start time could collide, so this method won't work for other OSs
system uptime was used for time instead of wall time, as the wall time could be moved backwards, and cause a collision in the three values.
MD5 is used because the shorter hash length will reduce data volumes, and security isn't strongly needed here.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
~~[ ] schema-next.yaml updated with changes to existing conventions.~~

mjwolf added 2 commits

April 29, 2024 15:16


          Add process.entity_id to registry

0e94fd5

Add `process.entity_id` based on the ECS field of the same name.


          Update note

15c57e2

mjwolf requested review from a team as code owners

April 29, 2024 23:40

github-actions bot assigned arminru

mjwolf added 3 commits

April 29, 2024 16:41


          update changelog

398e142


          revert CHANGELOG.md

9ee5887


          fix changelog

f92e96c

trisch-me reviewed

View reviewed changes

model/registry/process.yaml Outdated Show resolved Hide resolved

trisch-me approved these changes

View reviewed changes

pyohannes reviewed

View reviewed changes

docs/attributes-registry/process.md Outdated Show resolved Hide resolved

lmolkova reviewed

View reviewed changes

docs/attributes-registry/process.md Outdated Show resolved Hide resolved

docs/attributes-registry/process.md Outdated Show resolved Hide resolved

docs/attributes-registry/process.md Outdated

+              By following this method, the same entity_id will be generated by all
+              observers of a given process.
+              **[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.

Contributor

lmolkova May 7, 2024

Suggested change

      
            **[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
          
            **[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within this process namespace.

docs/attributes-registry/process.md Outdated

+              | Key | Description | Example Value |
+              | --- | --- | --- |
+              | `pid` | PID | "1201" |
+              | `starttime` | start time in seconds since boot | "3029" |

Contributor

lmolkova May 7, 2024

Suggested change

      
            | `starttime` | start time in seconds since boot | "3029" |
          
            | `starttime` | The number of seconds since the last boot | "3029" |

Contributor Author

mjwolf May 8, 2024

I've change this to The process start time in seconds since system boot, just to make it clear that's what time it's referring to

docs/attributes-registry/process.md Outdated

+              | --- | --- | --- |
+              | `pid` | PID | "1201" |
+              | `starttime` | start time in seconds since boot | "3029" |
+              | `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |

Contributor

lmolkova May 7, 2024

Suggested change

      
            | `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |
          
            | `bootid` | The boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |

docs/attributes-registry/process.md Outdated

@@ @@ -12,6 +12,7 @@ @@
               | `process.command_args` | string[] | All the command arguments (including the command/executable itself) as received by the process. On Linux-based systems (and some other Unixoid systems supporting procfs), can be set according to the list of null-delimited strings extracted from `proc/[pid]/cmdline`. For libc-based executables, this would be the full argv vector passed to `main`. | `[cmd/otecol, --config=config.yaml]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
               | `process.command_line` | string | The full command used to launch the process as a single string representing the full command. On Windows, can be set to the result of `GetCommandLineW`. Do not set this if you have to assemble it just for monitoring; use `process.command_args` instead. | `C:\cmd\otecol --config="my directory\config.yaml"` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
               | `process.creation.time` | string | The date and time the process was created, in ISO 8601 format. | `2023-11-21T09:25:34.853Z` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
+              | `process.entity_id` | string | A unique identifier of the process. [1] | `9f26bf6c-2508-43e7-a048-ed1822d9ff52`; `903b0daafe95e8ed9c9c0c9070259c29` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

Contributor

lmolkova May 7, 2024

can we give it a more descriptive name? E.g. process.global.id ? Entity id seems to be very broad and does not seem to match the description

Contributor Author

mjwolf May 8, 2024

I could rename this, but do you think we should make a namespace for one field? Or can you think of other things we might want to put in global in the future?

Contributor

trisch-me May 14, 2024

it is not global, it should be unique for given process but I don't think we can claim it is global

Also this field is used in 23 elastic integrations as well as internally. I wouldn't change it without sufficient solid grounds

docs/attributes-registry/process.md Outdated Show resolved Hide resolved

docs/attributes-registry/process.md Outdated

+              **[1]:** A globally unique identifier of the process. This can be any format
+              of unique indentifier.
+              On Linux, where possible, it is RECOMMENDED that this is calculated

Contributor

lmolkova May 7, 2024 •

edited

Other other limitations except Linux processes?

Can we do either "On Linux it is RECOMMENDED" or "Where possible it is RECOMMENDED"?

Contributor Author

mjwolf May 8, 2024

Since all OSs should have PID and process start time, the only thing that's needed for this to work is a unique boot id which can be read by all collectors. AFAIK, only Linux has a boot UUID, but there could be other ways to generate one. Maybe some sort of IPC, where the first running collector on the system generates a boot UUID, and then the others read it. But that will start to get more complicated, and I think less likely that the recommended method is adopted.

I also don't know Windows or macos very well, if anyone knows of a way to get a boot UUID on those systems, I could add it to the note, and this should work for those too.

For now, I've changed this to "Where possible it is RECOMMENDED..."

model/registry/process.yaml Outdated Show resolved Hide resolved

arminru reviewed

View reviewed changes

docs/attributes-registry/process.md Outdated

+              of unique indentifier.
+              On Linux, where possible, it is RECOMMENDED that this is calculated
+              MD5 hash of the string `<pid>_<starttime>_<bootid>`.

Member

arminru May 7, 2024

Is this the first time we define an attribute value as a hash? I couldn't find any other instance at this point but if we do have a similar definition elsewhere, we should consistently stick to one algorithm so that instrumentation libraries can use the same hashing library and don't have to resort to multiple different ones. If it's the first time we define one, we should be fine with sticking to MD5 also for all other such hash-derived values.

Contributor

jsuereth May 8, 2024

service.instance.id is kind of a hash. It's not MD5 though.

Personally I'd prefer NOT to specify which hash algorithm is used, instead requirements (e.g. <N% collision over X cardinality). OR - we can agree that recommended algorithms can be updated as non-breaking changes - otherwise I think it risks change in the future.

Contributor Author

mjwolf May 8, 2024

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change. Since this recommended algorithm can't be used in all situations (or some implementors could choose not to use it), it won't be possible for anything to rely on the entity_id being identical between collectors.

Member

arminru May 13, 2024

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Makes sense.

What would you think of taking the tuple string directly instead of using a hash?

Since attribute value size can be of concern, we could consider using the last fragment of boot_id only, which would yield a string shorter than the MD5 hash would be. Not sure if the format is well-defined and portable, though.
man for random(4) only "defines" it by means of an example:

   uuid and boot_id
         These read-only files contain random strings like
         6fd5a44b-35f4-4ad4-a9b9-6b9be13e1fe9.  The former is
         generated afresh for each read, the latter was generated
         once.

Member

arminru May 13, 2024

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change.

There isn't a syntax in YAML to define it like that but we could consider putting that in a note.
Your collectors might not run the same version but I'd hope we'd only change the hash algorithm (if we use one) once or twice in years, so it shouldn't be a large concern.

Contributor

trisch-me May 14, 2024

I'm in favor of adding a note to the field describing hashing algorithm
I wouldn't use tuple though. Strings are easier to compare and hash is also a very common thing to produce such ids.
Internally in one of security product we use for example hash of instrumentation uid and process uid.

Contributor Author

mjwolf May 14, 2024

What would you think of taking the tuple string directly instead of using a hash?

Some Elastic products are actually currently using a tuple, but we'd like to move away from it. Since systems can generate a very large number of process events, there's value in trying to keep event logs as small as possible, and we haven't ever really had a use for the data within the entity_id tuple, so we think it's better to use a hash.

I think using a tuple with only partial uuid segments would increase the chances of collision, although I'm not sure by how much.

Member

arminru May 28, 2024

With tuple I wasn't actually referring to a semantic combination of multiple attributes but actually using a single attribute with a single string value that's composed of the individual components (by concatenating them with some character, just like the input used for the hash function, i.e.. "<pid>_<starttime>_<bootid>").

Member

trask commented May 7, 2024

cc @open-telemetry/semconv-system-approvers

mjwolf and others added 3 commits

May 8, 2024 11:45


          Merge remote-tracking branch 'upstream/main' into process-entity_id

469653d


          Update notes

2ff65cc


          Merge branch 'main' into process-entity_id

07a739b

mjwolf mentioned this pull request

REQUEST: New membership for @mjwolf open-telemetry/community#2116

Closed

6 tasks

Member

joaopgrassi commented May 28, 2024 •

edited by arminru

I wonder how this attribute related/conflicts with the entities OTEP (open-telemetry/oteps#256). @jsuereth @tigrannajaryan.

tigrannajaryan requested changes

View reviewed changes

Member

tigrannajaryan left a comment •

edited

Blocking this since this is not in agreement with what Entities SIG is doing for identifiers.

Please see the "Entity Identification" section of OTEP 0252: https://github.com/open-telemetry/oteps/blob/main/text/entities/0256-entities-data-model.md#entity-identification on what we think is the way going forward to identify all entities, including processes.

Specifically, the OTEP says that UUIDs are not a good idea for entity ids so this PR goes against that.

[EDIT to add a bit more color].

Apologies for being late to the party (thanks @joaopgrassi for pinging me).

I disagree with the proposed way to identify the process id and it is in direct disagreement with the Entities data model OTEP. The OTEP's suggested way to identify processes would be to use (pid, start_time) + any GID of the IDCONTEXT where the process runs (read the OTEP for more details). This PR's suggestion to use the locally generated bootid as a seed to ID generation goes against the design principles of the OTEP that were reviewed and accepted by the Entities SIG:

Meaningful (especially human-readable) IDs are more valuable than random-generated IDs. Long-lived IDs that survive state changes (e.g. entity restarts) are more valuable than short-lived, ephemeral IDs. See the need for navigation.

Contributor Author

mjwolf commented May 28, 2024

Hi @tigrannajaryan, I wasn't aware of the Entities SIG, or that OTEP until now.

I think this PR will need to wait until the multiple observers question in the OTEP is resolved since that is also what the proposed entity_id algorithm here is attempting to resolve. I can hold off on further work on this PR until that's resolved.

Contributor Author

mjwolf commented May 29, 2024

I'll close this PR, it'll be a very different change to implement the entities OTEP

mjwolf closed this

Member

tigrannajaryan commented May 29, 2024

@mjwolf you are also welcome to join the Entities SIG if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jsuereth jsuereth left review comments

lmolkova lmolkova left review comments

arminru arminru left review comments

trisch-me trisch-me approved these changes

pyohannes pyohannes left review comments

tigrannajaryan tigrannajaryan requested changes