Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process.entity_id to attribute registry #983

Closed
wants to merge 8 commits into from

Conversation

mjwolf
Copy link
Contributor

@mjwolf mjwolf commented Apr 29, 2024

Changes

Add process.entity_id attribute, from the ECS field of the same name.

entity_id is a globally unique identifier for a process. The existing attributes may no be sufficent to uniquely identify a process; PIDs can be re-used, and other attributes, like start time, might collide, especially when VMs running identical images are started at the same time.

entity_id could be any random or application-generated UUID, but it would be preferable to use process derived values to generate, when possible. By using process attributes, and a shared algorithm, different collectors running on the same system and observing the same process would generate the same ID, and users could later know that it was indeed the same process seen by both observers.

These are some notes on the recommended method I have here:

  • AFAIK, only Linux has unique boot IDs. Windows and macos have machine IDs, which could used in a similar way, but some OSs restrict reading the machine ID, to prevent user tracking. Without having a unique boot id, the PID and process start time could collide, so this method won't work for other OSs
  • system uptime was used for time instead of wall time, as the wall time could be moved backwards, and cause a collision in the three values.
  • MD5 is used because the shorter hash length will reduce data volumes, and security isn't strongly needed here.

Merge requirement checklist

Add `process.entity_id` based on the ECS field of the same name.
@mjwolf mjwolf requested review from a team as code owners April 29, 2024 23:40
docs/attributes-registry/process.md Outdated Show resolved Hide resolved
docs/attributes-registry/process.md Outdated Show resolved Hide resolved
By following this method, the same entity_id will be generated by all
observers of a given process.

**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within the process namespace that the process exists within.
**[2]:** The process ID within a PID namespace. This is not necessarily unique across all processes on the host but it is unique within this process namespace.

| Key | Description | Example Value |
| --- | --- | --- |
| `pid` | PID | "1201" |
| `starttime` | start time in seconds since boot | "3029" |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `starttime` | start time in seconds since boot | "3029" |
| `starttime` | The number of seconds since the last boot | "3029" |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've change this to The process start time in seconds since system boot, just to make it clear that's what time it's referring to

| --- | --- | --- |
| `pid` | PID | "1201" |
| `starttime` | start time in seconds since boot | "3029" |
| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `bootid` | boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |
| `bootid` | The boot ID, as read from `/proc/sys/kernel/random/boot_id` | "16b6f4a4-179e-4480-9561-0df471d8c6d4" |

@@ -12,6 +12,7 @@
| `process.command_args` | string[] | All the command arguments (including the command/executable itself) as received by the process. On Linux-based systems (and some other Unixoid systems supporting procfs), can be set according to the list of null-delimited strings extracted from `proc/[pid]/cmdline`. For libc-based executables, this would be the full argv vector passed to `main`. | `[cmd/otecol, --config=config.yaml]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.command_line` | string | The full command used to launch the process as a single string representing the full command. On Windows, can be set to the result of `GetCommandLineW`. Do not set this if you have to assemble it just for monitoring; use `process.command_args` instead. | `C:\cmd\otecol --config="my directory\config.yaml"` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.creation.time` | string | The date and time the process was created, in ISO 8601 format. | `2023-11-21T09:25:34.853Z` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `process.entity_id` | string | A unique identifier of the process. [1] | `9f26bf6c-2508-43e7-a048-ed1822d9ff52`; `903b0daafe95e8ed9c9c0c9070259c29` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give it a more descriptive name? E.g. process.global.id ? Entity id seems to be very broad and does not seem to match the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename this, but do you think we should make a namespace for one field? Or can you think of other things we might want to put in global in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not global, it should be unique for given process but I don't think we can claim it is global

Also this field is used in 23 elastic integrations as well as internally. I wouldn't change it without sufficient solid grounds

docs/attributes-registry/process.md Outdated Show resolved Hide resolved
**[1]:** A globally unique identifier of the process. This can be any format
of unique indentifier.

On Linux, where possible, it is RECOMMENDED that this is calculated
Copy link
Contributor

@lmolkova lmolkova May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other other limitations except Linux processes?

Can we do either "On Linux it is RECOMMENDED" or "Where possible it is RECOMMENDED"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all OSs should have PID and process start time, the only thing that's needed for this to work is a unique boot id which can be read by all collectors. AFAIK, only Linux has a boot UUID, but there could be other ways to generate one. Maybe some sort of IPC, where the first running collector on the system generates a boot UUID, and then the others read it. But that will start to get more complicated, and I think less likely that the recommended method is adopted.

I also don't know Windows or macos very well, if anyone knows of a way to get a boot UUID on those systems, I could add it to the note, and this should work for those too.

For now, I've changed this to "Where possible it is RECOMMENDED..."

model/registry/process.yaml Outdated Show resolved Hide resolved
of unique indentifier.

On Linux, where possible, it is RECOMMENDED that this is calculated
MD5 hash of the string `<pid>_<starttime>_<bootid>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the first time we define an attribute value as a hash? I couldn't find any other instance at this point but if we do have a similar definition elsewhere, we should consistently stick to one algorithm so that instrumentation libraries can use the same hashing library and don't have to resort to multiple different ones. If it's the first time we define one, we should be fine with sticking to MD5 also for all other such hash-derived values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service.instance.id is kind of a hash. It's not MD5 though.

Personally I'd prefer NOT to specify which hash algorithm is used, instead requirements (e.g. <N% collision over X cardinality). OR - we can agree that recommended algorithms can be updated as non-breaking changes - otherwise I think it risks change in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change. Since this recommended algorithm can't be used in all situations (or some implementors could choose not to use it), it won't be possible for anything to rely on the entity_id being identical between collectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to specify what hash algorithm to use, otherwise different collectors could use different algorithms, and the resulting values wouldn't match for the same process. The whole intention of this recommended method is to have the identical values generated for a process. The algorithm used doesn't matter as much. Maybe it should be something other than MD5, if we use only one, and it can be used for other things that could be more affected by collisions.

Makes sense.

What would you think of taking the tuple string directly instead of using a hash?

Since attribute value size can be of concern, we could consider using the last fragment of boot_id only, which would yield a string shorter than the MD5 hash would be. Not sure if the format is well-defined and portable, though.
man for random(4) only "defines" it by means of an example:

   uuid and boot_id
         These read-only files contain random strings like
         6fd5a44b-35f4-4ad4-a9b9-6b9be13e1fe9.  The former is
         generated afresh for each read, the latter was generated
         once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place we can specify that entity_id might not match between versions? If so, then I think we can say that changing the algorithm between versions is a non-breaking change.

There isn't a syntax in YAML to define it like that but we could consider putting that in a note.
Your collectors might not run the same version but I'd hope we'd only change the hash algorithm (if we use one) once or twice in years, so it shouldn't be a large concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding a note to the field describing hashing algorithm
I wouldn't use tuple though. Strings are easier to compare and hash is also a very common thing to produce such ids.
Internally in one of security product we use for example hash of instrumentation uid and process uid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of taking the tuple string directly instead of using a hash?

Some Elastic products are actually currently using a tuple, but we'd like to move away from it. Since systems can generate a very large number of process events, there's value in trying to keep event logs as small as possible, and we haven't ever really had a use for the data within the entity_id tuple, so we think it's better to use a hash.

I think using a tuple with only partial uuid segments would increase the chances of collision, although I'm not sure by how much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With tuple I wasn't actually referring to a semantic combination of multiple attributes but actually using a single attribute with a single string value that's composed of the individual components (by concatenating them with some character, just like the input used for the hash function, i.e.. "<pid>_<starttime>_<bootid>").

@trask
Copy link
Member

trask commented May 7, 2024

cc @open-telemetry/semconv-system-approvers

@joaopgrassi
Copy link
Member

joaopgrassi commented May 28, 2024

I wonder how this attribute related/conflicts with the entities OTEP (open-telemetry/oteps#256). @jsuereth @tigrannajaryan.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking this since this is not in agreement with what Entities SIG is doing for identifiers.

Please see the "Entity Identification" section of OTEP 0252: https://github.com/open-telemetry/oteps/blob/main/text/entities/0256-entities-data-model.md#entity-identification on what we think is the way going forward to identify all entities, including processes.

Specifically, the OTEP says that UUIDs are not a good idea for entity ids so this PR goes against that.

[EDIT to add a bit more color].

Apologies for being late to the party (thanks @joaopgrassi for pinging me).

I disagree with the proposed way to identify the process id and it is in direct disagreement with the Entities data model OTEP. The OTEP's suggested way to identify processes would be to use (pid, start_time) + any GID of the IDCONTEXT where the process runs (read the OTEP for more details). This PR's suggestion to use the locally generated bootid as a seed to ID generation goes against the design principles of the OTEP that were reviewed and accepted by the Entities SIG:

Meaningful (especially human-readable) IDs are more valuable than random-generated IDs. Long-lived IDs that survive state changes (e.g. entity restarts) are more valuable than short-lived, ephemeral IDs. See the need for navigation.

@mjwolf
Copy link
Contributor Author

mjwolf commented May 28, 2024

Hi @tigrannajaryan, I wasn't aware of the Entities SIG, or that OTEP until now.

I think this PR will need to wait until the multiple observers question in the OTEP is resolved since that is also what the proposed entity_id algorithm here is attempting to resolve. I can hold off on further work on this PR until that's resolved.

@mjwolf
Copy link
Contributor Author

mjwolf commented May 29, 2024

I'll close this PR, it'll be a very different change to implement the entities OTEP

@mjwolf mjwolf closed this May 29, 2024
@tigrannajaryan
Copy link
Member

@mjwolf you are also welcome to join the Entities SIG if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

9 participants