Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is htop safe from PID reuse? #1441

Closed
giampaolo opened this issue Apr 8, 2024 · 16 comments
Closed

Is htop safe from PID reuse? #1441

giampaolo opened this issue Apr 8, 2024 · 16 comments
Labels
support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead.

Comments

@giampaolo
Copy link

giampaolo commented Apr 8, 2024

Hi. This is a question, so sorry in advance if this is not appropriate for the bug tracker.
Is htop safe from PID reuse? E.g. if a PID is reused and I SIGTERM it via htop, is there a risk that I terminate the wrong (new) process?
Furthermore, I'm not sure if some htop columns are cached (COMMAND column?). If they are: is there a risk that htop keeps showing the wrong column for the reused PID / process?

Thanks

@BenBE BenBE added the support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead. label Apr 8, 2024
@BenBE
Copy link
Member

BenBE commented Apr 8, 2024

Hi.
This is a question, so sorry in advance if this is not appropriate for the bug tracker.

That's what we got the "Support Request" tag for … ;-)

Is htop safe from thread reuse? E.g. if a PID is reused and I SIGTERM it via htop, is there a risk that I terminate the wrong (new) process?

That question is not easy to answer in general, but as there currently are no mechanisms in place to track short-lived processes nor process terminations, there is a valid chance to accidentally send a signal to a new process. The specifics would need some in-depth investigation, but the pre-conditions are basically:

  1. The old process/thread with a given PID recently died
  2. A new process/thread is created that re-uses that PID

As those two conditions are not "lockable" from userspace ALL process monitors are likely to have this issue in one form or another. With some OS (for example on Windows) there are workarounds that could mitigate this (by keeping a process handle, thus forcing the PID to remain "dormant". On Linux and *BSD this is AFAIK not an option.

Furthermore, I'm not sure if some htop columns are cached (COMMAND column?).

To reduce the resource usage most columns that rarely change are somewhat cached or refreshed only at a slower rate. The two most prominent columns this applies to are the process' command line and the shared memory usage.

For the command line (plus: thread names, executable name, and current working directory) there's a setting to force refresh in every update cycle.

For the shared memory usage things are updated roughly every 2-3 update cycles to reduce load. This was done as processes rarely change their set of loaded libraries drastically over a short time.

If they are: is there a risk that htop keeps showing the wrong column for the reused PID / process?

A chance for this to happen exists IFF the refresh setting mentioned above is turned off (default AFAIR), the old process exits, AND a new process re-using that PID starts ALL between two refresh cycles of htop (by default 1.5 seconds, minimum 0.1 seconds, maximum infinite).

The chance for this to happen is negligible on mostly idle systems and fairly small on busy ones unless there's really high load with many processes starting/stopping each second.

Thanks

You're welcome.

@giampaolo
Copy link
Author

giampaolo commented Apr 8, 2024

To be entirely honest and for full disclosure: I'm the author of a htop-like Python library called psutil, so this is why I showed up here. I was hoping htop solved this issue which I don't know how to solve (see giampaolo/psutil#2396). I think I can return back the favor though.

That question is not easy to answer in general

I know. :)

there currently are no mechanisms in place to track short-lived processes nor process terminations, there is a valid chance to accidentally send a signal to a new process.

Actually there is a solution to this.

  1. On all platforms (including Windows), when you list a process for the first time, you can save its PID + creation time internally. From then on, they will represent the process unique identifier. Right before you send signal to the process you can get PID's creation time once again, and if it's different than before then it means that PID has been reused. Reason: a PID may be recycled, but its creation time will necessarily be different (higher) than its predecessor. Here's the relevant parts in psutil source code:
    https://github.com/giampaolo/psutil/blob/841902c1c342121ee8d07d4b061c23de43de050a/psutil/__init__.py#L608-L614
    https://github.com/giampaolo/psutil/blob/841902c1c342121ee8d07d4b061c23de43de050a/psutil/__init__.py#L375-L379

  2. Very recently (today =)) I discovered 2 syscalls on Linux and FreeBSD which prevent this race condition from happening. See discussion at: [Linux / FreeBSD] evaluate using pidfd_send_signal() for signaling processes giampaolo/psutil#2400. EDIT: it cannot work because you can only open 1024 fds.

I believe you can implement either solution 1 or 2 in htop as well. Solution 1 works on all platforms including Windows, and it has been battle tested in psutil for years, so I would recommend this one.

@natoscott
Copy link
Member

| Actually there is a solution to this.

Hmm, I'm not convinced.

In option #1 there remains a race condition between the second PID creation time check and the time when you send the signal. I guess you could check creation time again after sending the signal, and then say "oops, sorry, I may have done the wrong thing, not sure" to the user if the PID changed in-between ... but the problem isn't solved AFAICT. And since its common to be signaling with SIGTERM / SIGKILL, any subsequent check is going to be very unreliable anyway.

Option #2 sounds more feasible but I still wonder if this is primarily a theoretical issue? The kernels PID selection strategies make rapid reuse unlikely, so I think this may be a "solution looking for a problem" in system tools like ours - has anyone ever reported this issue occurring? I can definitely see a rationale for that syscall in other situations, but not so much for system tools that are sampling PIDs frequently.

@giampaolo
Copy link
Author

giampaolo commented Apr 8, 2024

In option #1 there remains a race condition between the second PID creation time check and the time when you send the signal.

Theoretically you are correct, there is a (very small) time window during which the PID could be reused, see giampaolo/psutil#2400. I would speculate that the kernel is smart enough not to reuse the same PID that quickly though.

Option #2 sounds more feasible.

There is a downside to using this option that I didn't mention because I realized it just now. To use solution 2 you have to pre-emptively save pidfd_open's fd for all PIDs, in order to use it later on kill(), but there is a limit to the number of fds that you can open or you'll get EMFILE (too many open files). On Linux the limit is:

$ ulimit -Sn
1024

@natoscott
Copy link
Member

| I would speculate that the kernel is smart enough not to reuse the same PID that quickly though.

100% agreed. And given we're sample every 1-2 seconds by default, this whole issue is likely a non-problem in practice.

| There is a downside to using this option

+1

It could be solved though. We typically only signal one PID at a time (requires UI selection/interaction), so if we went this path (I'm definitely not advocating for it!) it could be done in a way that only selected processes have open FDs associated with them.

@Explorer09
Copy link
Contributor

Explorer09 commented Apr 10, 2024

AFAIK, there is no general solution. In Unix-like systems the PIDs are only reserved for the parents when a process died and that's what "zombie processes" refer to. Once the PIDs are freed (i.e. zombies reaped by the parent process) another process can be allocated the same PID, even though the OS would try to avoid that whenever possible. Since PID is a limited space, your only chance of minimizing the PID collision is raising the pid_max limit.

And by the way. If the OS would reserve PIDs for the process managers like htop, we could end up a lot of PIDs reserved and become "zombies" when a process manager is not responsive.

@Explorer09
Copy link
Contributor

Explorer09 commented Apr 10, 2024

  1. On all platforms (including Windows), when you list a process for the first time, you can save its PID + creation time internally. From then on, they will represent the process unique identifier. Right before you send signal to the process you can get PID's creation time once again, and if it's different than before then it means that PID has been reused.

Like mentioned in the above comments. There is TOCTOU. Unless your OS has a kill() API that also takes creation time as an argument, it won't help.

  1. Very recently (today =)) I discovered 2 syscalls on Linux and FreeBSD which prevent this race condition from happening. See discussion at: [Linux / FreeBSD] evaluate using pidfd_send_signal() for signaling processes giampaolo/psutil#2400.

This "pidfd" solution sounds bad because it forces the OS to reserve a process reference for us (in a file descriptior rather than PID). We would then need to manage the "FDs" ourselves to avoid internal resource leak. It complicates the OS side for managing resources as well because a process can be "opened" by multiple process managers (including multiple htop instances), resulting in more uses of the "FDs" than necessary. I think the OS not reserving PID resources for a process manager should be a feature, not an error.

@natoscott
Copy link
Member

I think we're all in agreement there's nothing we should change in htop here (if I got that wrong, please reopen & lets discuss further)

@Explorer09
Copy link
Contributor

While this support question can be closed. I have a feature proposal in case you guys are interested: #1442

@giampaolo
Copy link
Author

giampaolo commented Apr 11, 2024

I think we're all in agreement there's nothing we should change in htop here (if I got that wrong, please reopen & lets discuss further)

FWIW, I think htop should use creation time as I previously described in #1441 (comment) (solution 1). This is racy like any other user-space solution based on PID alone, but it gives a high level of reliability because the race condition is extremely unlikely to occur in practice.

Having zero checks in place that try to prevent killing a reused PID may lead to data loss, DoS or have security implications.

@Explorer09
Copy link
Contributor

Explorer09 commented Apr 11, 2024

@giampaolo No. That's a false sense of security as you didn't eliminate the race totally.
And when you talk about how unlikely the race can happen in practice, keep in mind the PID reuse and collision is unlikely already, and adding a creation time doesn't help anything.

You should raise the pid_max limit if you are truly afraid of this. Another mitigation for the issue is to limit the user's privilege when killing a process, so that the user won't accidentally kill a process owned by someone else.

By the way, #1442 would also be a partial solution. There is still a race between a process entry being last updated and the process file descriptor being opened. But the usability could be a little better as the user can review again what processes they are killing.

@giampaolo
Copy link
Author

giampaolo commented Apr 11, 2024

Discussion is getting split. :)

The problem with #1442 is that it checks for process identity very late in the lifetime of the process. If htop is being open for 10 minutes and PID reuse happened 5 minutes ago you will not know.

Detecting PID reusage is closely related to identifying a process uniquely over time. You can't use just the PID, so you have to add something else to the mix. That can be PID creation time or pidfd, it doesn't matter, but you have to do that ASAP, meaning on startup and every time a new PID shows up, and do it for all PIDs. And when you do that you want to store creation time or pidfd internally, so that you can use it (much) later on kill().

The downside of pidfd is that you'll soon run out of FDs and it's Linux only, which is why I deem creation time a better solution.

@Explorer09
Copy link
Contributor

@giampaolo

If htop is being open for 10 minutes and PID reuse happened 5 minutes ago you will not know.

htop periodically updates the process list. Unless you pause the update, you would notice it already.
There is no way to track process death or PID reuse in any Unix-like system asynchronously, except for parent process that can receive SIGCHLD from the children. This is by design.

Detecting PID reusage is closely related to identifying a process uniquely over time. You can use creation time or pidfd, it doesn't matter, but you have to do that ASAP and for all processes, meaning on startup and every time a new PID shows up.

Keep in mind that "process ID + creation time" combination does not make the process unique. There is precision issue in time measurements, and the process can spawn and die very quickly between time measurements, so the identifier like this won't be as unique as you think. (Even v1 UUID format needs to avoid the issue where two ID generation requests happen very quickly.) There is no "truly unique" identifier for processes as far as I can think of.

This is a non-issue. It's more of a limitation due to OS design, and a process manager like htop can't help anything with it.

@giampaolo
Copy link
Author

htop periodically updates the process list. Unless you pause the update, you would notice it already.

No you won't. htop simply sees that PID X existed before and after the update. It doesn't check whether that PID belongs to a different process now, so from htop perspective nothing changed. It will even show the old process CMDLINE, since it's cached (which is fine, I'm merely talking about making kill() safe).

Keep in mind that "process ID + creation time" combination does not make the process unique. There is precision issue in time measurements, and the process can spawn and die very quickly between time measurements, so the identifier like this won't be as unique as you think.

Agreed. It's a compromise. Linux provides a 2 digits creation time precision (e.g. 432904.78). That means that the creation time strategy guarantees that you won't kill the wrong process unless the OS reused the same PID in the last 3 digits seconds (e.g. 432904.785), which is way better than having no check at all IMO.

@BenBE
Copy link
Member

BenBE commented Apr 11, 2024

@giampaolo

If htop is being open for 10 minutes and PID reuse happened 5 minutes ago you will not know.

htop periodically updates the process list. Unless you pause the update, you would notice it already. There is no way to track process death or PID reuse in any Unix-like system asynchronously, except for parent process that can receive SIGCHLD from the children. This is by design.

There is some limited way with kprobes/event triggers. We will eventually need to implement these for tracking short-lived processes, but for the feature suggested here they are pure overkill …

Detecting PID reusage is closely related to identifying a process uniquely over time. You can use creation time or pidfd, it doesn't matter, but you have to do that ASAP and for all processes, meaning on startup and every time a new PID shows up.

As said, you can do this with kernel tracing, but this is overkill if it were just for this one (rare) situation …

Keep in mind that "process ID + creation time" combination does not make the process unique. There is precision issue in time measurements, and the process can spawn and die very quickly between time measurements, so the identifier like this won't be as unique as you think. (Even v1 UUID format needs to avoid the issue where two ID generation requests happen very quickly.) There is no "truly unique" identifier for processes as far as I can think of.

See above …

@Explorer09
Copy link
Contributor

No you won't. htop simply sees that PID X existed before and after the update. It doesn't check whether that PID belongs to a different process now, so from htop perspective nothing changed. It will even show the old process CMDLINE, since it's cached (which is fine, I'm merely talking about making kill() safe).

Would you report this as a bug?

I think at least there should be a sanity check to ensure the process CMDLINE or username or whatever is same as before. The process CMDLINE can change after an execve(2) call. And for the username it might help avoiding the "killing the wrong process" problem you mentioned above.

(Perhaps we should detect also changes of PGRP (process group ID) and sessions. These are unlikely to collide even when a PID is reused between a short time.)

I mentioned user privilege being one possible way to mitigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead.
Projects
None yet
Development

No branches or pull requests

4 participants