Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git.execute's kill_after_timeout callback assumes procps #1756

Open
EliahKagan opened this issue Dec 3, 2023 · 9 comments
Open

Git.execute's kill_after_timeout callback assumes procps #1756

EliahKagan opened this issue Dec 3, 2023 · 9 comments

Comments

@EliahKagan
Copy link
Contributor

Background

Calling Git.execute—whether directly, or indirectly by calling the dynamic attributes of a Git instance—and passing kill_after_timeout with a non-None value, creates a timer on a separate thread that calls the local kill_process function. This callback function uses os.kill to kill the process. Before killing the process, it enumerates the process's direct children. If sending the first signal succeeds (basically, if the parent process still existed), it also attempts to kill the child processes.

The children are enumerated with ps --ppid:

GitPython/git/cmd.py

Lines 1010 to 1014 in fe082ad

p = Popen(
["ps", "--ppid", str(pid)],
stdout=PIPE,
creationflags=PROC_CREATIONFLAGS,
)

The problem

The --ppid option is not POSIX. Most GNU/Linux systems have procps, whose ps implementation supports --ppid. I am unsure if any other implementations of ps support it. The procps tools generally run only on Linux-based systems, because they use the /proc filesystem (and assume it is laid out as in Linux). Although they can run on any such system, Alpine Linux and some minimal GNU/Linux environments do not ship them, defaulting to ps from busybox instead.

As demonstrated below, macOS ps does not support --ppid. Nor do FreeBSD, NetBSD, OpenBSD, or DragonFly. AIX does not have --ppid. illumos does not have --ppid; nor does Solaris, though -ppid (with one -) can be used in 11.4.27 or higher. Although Cygwin mimics Linux where feasible, its /proc filesystem is different, and its ps does not support --ppid either (nor even some important POSIX options like -o).

The callback parses stdout from that ps command, but does not examine the exit status or stderr. The effect is that an error message on a system without procps (or another ps supporting --ppid, if there is one) is printed, and the parent process is still sent SIGKILL, but its children are never found or sent signals.

As detailed below, although the fetch, pull, and push methods of the Remote class accept a kill_after_timeout argument, they do not use Git.execute, so they are unaffected by this bug.

Steps to reproduce

On macOS 13 (on a GitHub Actions CI runner with tmate), I created this script in a directory in $PATH, named it git-sleep, and marked it executable:

#!/bin/sh
sleep "$@"

Then I called sleep on a Git instance with a kill_after_timeout argument specifying a shorter duration than the sleep:

bash-3.2$ python
Python 3.12.0 (v3.12.0:0fb18b02c8, Oct  2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from git import Git
>>> Git().sleep(10, kill_after_timeout=5)
ps: illegal option -- -
usage: ps [-AaCcEefhjlMmrSTvwXx] [-O fmt | -o fmt] [-G gid[,gid...]]
          [-g grp[,grp...]] [-u [uid,uid...]]
          [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]]
       ps [-L]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 741, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1320, in _call_process
    return self.execute(call, **exec_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/GitPython/GitPython/git/cmd.py", line 1117, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-9)
  cmdline: git sleep 10
  stderr: 'Timeout: the command "git sleep 10" did not complete in 5 secs.'
>>>

Impact

1. The other kill_after_timeout is unaffected

There are two callables defined in git/cmd.py that accept an optional kill_after_timeout argument: the "internal" top-level handle_process_output function that is not listed in __all__ but is used throughout GitPython, and the public Git.execute method (also used when dynamic Git methods are called). The meaning of this argument is subtly different, and the associated implementations completely different.

This bug affects only the one in the Git class. Thus it does not affect common uses of timeouts in interacting with remotes: the Remote.fetch, Remote.push, and Remote.pull methods accept kill_after_timeout arguments, but they forward them to handle_process_output.

2. But this one should work on all Unix-like systems

From context, I think it is unintended not to support common Unix-like systems such as macOS. The Git.execute docstring says "This feature is not supported on Windows" and makes no other claims about compatibility, from which I think readers will reasonably infer that other platforms are believed supported. When called on a native Windows system (not Cygwin) with a non-None value for kill_after_timeout, it raises a GitCommandError. Other systems, including Cygwin, raise no exception and register the kill_process callback. kill_after_timeout is thus in effect documented to work on all systems except native Windows.

3. What happens if the child processes aren't sent SIGKILL?

I don't know how much of a problem it is for SIGKILL to be sent only to the parent and not to its direct children. I am not confident I know why that is being done, as opposed to killing only the parent process, or attempting to kill its entire process tree. My guess is that this is because many git commands use a subprocess to do their work. If so, then it may in practice be important—in situations where people pass kill_after_timeout—that the child processes are killed as well.

However, git subprocesses do sometimes use their own subprocesses:

ek@Glub:~$ pstree -a
init(Ubuntu)
  ├─SessionLeader
  │   └─Relay(9)
  │       ├─bash
  │       │   └─git clone https://github.com/huggingface/transformers.git
  │       │       └─git remote-https origin https://github.com/huggingface/transformers.git
  │       │           └─git-remote-http origin https://github.com/huggingface/transformers.git
...

In that example, the git-remote-http process may not receive SIGKILL. I am unsure how much this matters, but if it is a problem, then the more severe it is, the less severe this bug is, because the intended behavior wouldn't help anyway. Likewise, in situations where killing the parent process is sufficient, this bug also does not cause a problem.

That lower descendants are not killed has been reported as #895. That was observed in GitPython 2.0.2, which had the current approach of killing just the direct child processes.

A minor race condition…

One thing I'm a little worried about is a race condition that is currently present, and that I think may not be possible to fix, but that I worry finding child processes in a more portable way may exacerbate. Unless it can be solved or mitigated more deeply, it is a reason, unrelated to performance, to prefer that a portable substitute for the existing use of ps --ppid not be too much slower than the current way. (I likewise worry that if the approach were changed to kill all descendants, then the added time to traverse the whole subtree might exacerbate this race condition.)

Suppose we plan to kill a process P and all its direct child processes including Q, and we find the PID of Q, but before killing Q, all the following happen:

  1. Q dies.
  2. Q is reaped. That is, it is wait(2)ed by its parent--which is either its original parent P or, if P has died, then init--causing its entry in the process table to be removed and its PID to be available for use by a future process.
  3. A new process, R, is created and assigned Q's old PID.

Then when when we try to kill Q, we kill R.

This situation is rare, because in practice the time between when a process is reaped and when a new process is given its PID is only short when the process table is nearly full so the kernel has no less recently relinquished PIDs to give out. But I think it would be best to avoid increasing the risk of it.

There may be other related race conditions, but this is the one that seems it could be worsened by replacing the existing unportable use of ps --ppid with some other technique, if that other technique is markedly slower.

Finding/killing the the subprocesses portably

I am unsure if this should be done, because it is not clear to me that killing the parent process and its direct child processes, as is currently attempted (generally successfully on GNU/Linux and unsuccessfully elsewhere), is necessarily what should happen. Doing anything else might risk incompatibility for some existing use cases on some systems, so I would want to be cautious about doing something altogether different, but I think it should still be considered before proceeding.

However, assuming the current approach of killing the child processes should be preserved, I think there are three cases:

  1. Systems with pgrep/pkill.
  2. Systems whose ps is POSIX-compliant, or at least supports -A and -o.
  3. Cygwin. (And the like, e.g., MSYS2. But sys.platform == "cygwin" still covers that.)

Case 1 could be folded into case 2 if a speed regression is acceptable (but see above on the race condition), or if testing reveals using pgrep or pkill is not significantly faster. Case 3 could be dropped in lieu of modifying the docstring to document that kill_after_timeout is less effective on Cygwin, if reducing complexity is regarded as more important than covering it.

Whether to cover case 3 or not is more a matter of code complexity than time to write and review the code. With or without it, I think most of the time and effort would be on the tests. Currently none cover passing kill_after_timeout to Git.execute or to a dynamic method of a Git object. Only the other kill_after_timeout—of handle_process_output—has test coverage. Because this project has CI on Cygwin, I don't think the tests have to do much to accommodate it—its challenges are ready-made.

(An alternative to dealing with these details is to use psutil, but I'm unsure if the impact of this issue is sufficient to justify adding it as a dependency. It doesn't support all systems, but systems it doesn't support are rare. I think it could be made conditional on the systems it is installable on, and the features that use it be documented as unavailable on other systems. I think this is probably not worth doing just for this, but if it turns out it would help in various other places, and increasing rather than decreasing OS compatibility—as it would here—then it might make sense to consider it. On the other hand, one benefit of GitPython is that it has very few dependencies.)

1. If we have pgrep/pkill

pgrep and pkill are not POSIX, but they are available on many more systems than ps --ppid. Furthermore, it is likely that all systems that support ps --ppid also have pgrep and pkill, because not only are they very common, but procps (which provides the only ps with --ppid I can find, as discussed above) includes an implementation of them. Of course, it's possible (odd, but possible) for a distribution to use procps for ps but not include pgrep and pkill. Whether pkill can be used to consolidate the steps, or pgrep must be used together with something what is already there, is a design decision that should be influenced by a decision about the best order for sending SIGKILL.

If it's acceptable to send SIGKILL to the child processes first, then instead of running ["ps", "--ppid", str(pid)] and most of what comes after it, one option is to run ["pkill", "-P", str(pid)] and then:

  • If it succeeds, immediately call os.kill(pid, signal.SIGKILL) and kill_check.set().
  • If it fails due to pkill not existing—or if it was checked first and found absent—proceed to case 2 (using ps).

If it's not acceptable to send SIGKILL to the child processes first, or if either order is acceptable but it is desirable to share more code with the fallback case 2, then instead of running ["ps", "--ppid", str(pid)], run ["pgrep", "-P", str(pid)], then:

  • If it failed due to pgrep not existing—or if it was checked first and found absent—proceed to case 2.
  • Treat an exit status of 0 or 1 as success; 1 is when there were no children. (This seems to hold across different implementations, but I'll want to look into it further, since these tools are not POSIX or XSI.)
  • Parse the output—each line is just a PID, with no headers, no other columns—into child_pids.
  • Continue with the rest of the kill_process function as it already exists.

2. If ps supports -A and -o

This is almost every Unix-like system used today; POSIX requires these options.

Instead of running ["ps", "--ppid", str(pid)], run ["ps", "-A", "-o", "pid,ppid"], then:

  • Check that the first row is PID and PPID to safeguard against unexpectedly nonstandard ps.
  • Each remaining row should have child and parent process IDs. Filter for the rows where the parent (second column) is what we passed, and populate child_pids with the PIDs from the first column.
  • Continue with the rest of the kill_process function as it already exists.

If using this is as fast a pkill/pgrep, or slower but not by a lot, or code simplicity is considered more important than the small worsening of the rare race condition, then this could be used on all systems except Cygwin. The truth is that it is only out of fear of worsening things in weird situations on GNU/Linux systems with procps that I have even proposed case 1. This is the portable way to do it (except Cygwin).

It may be possible to optimize this with -U to filter the real user ID to os.getuid(), or -u to filter the effective user ID to os.geteuid(), though -u seems to be an XSI extension. I don't know if this would actually make things faster. I don't know if the added complexity, though modest, would be worthwhile even if it does. When doing this, -A would not also be passed.

The reason not to simply omit -A without replacing it, which gets processes that share the caller's EUID, is that it also only shows processes with the same controlling terminal. The reason not to use -a instead is that it doesn't show processes not associated with any terminal. The reason I prefer -A to its synonym -e is that -e seems to be an XSI extension.

3. Cygwin

Running ps on Cygwin gives output that looks like:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     1641       1    1641      27868  ?         197609   Nov 14 /usr/bin/ssh-agent
     2201       1    2201      27112  ?         197609 02:35:26 /usr/bin/mintty
     2202    2201    2202      41336  pty0      197609 02:35:26 /usr/bin/bash
     2304    2202    2304      47276  pty0      197609 14:47:12 /usr/bin/ps

This can be modified by various options, but -o is not supported. (-A is not supported either, but it is not needed.)

Instead of running ["ps", "--ppid", str(pid)], we can run ["ps"], then:

  • Check the first row headers, or at least the leading PID and PPID headers that we are going to use, to safeguard against unexpectedly non-Cygwin ps or future changes to Cygwin ps.
  • Continue as in case 2 after the check, making sure to use only the first two fields.

Perspective

I think the what is more important than the how, because:

Test coverage

Unlike the other kill_after_timeout (in handle_process_output), the code path where Git.execute is passed kill_after_timeout has no test coverage. It would be good to test it even if this bug is not fixed. But at the root of both is figuring out if killing the parent process and its direct child processes are what is wanted.

Maintainability

The kill_process callback is never called on (native) Windows, where calling Git.execute with a non-None value for kill_after_timeout raises GitCommandError. But it contains what seem to be the remains of an attempt to support Windows: it passes PROC_CREATIONFLAGS (which is 0 except on Windows) when running ps, and it falls back to signal.SIGTERM when signal.SIGKILL is absent (which it is on Windows).

I discovered this whole issue because I want to remove that code, which I think could lead to future bugs, and I was looking into whether there is any reason not to. A possible reason not to is if kill_process can be easily modified to support Windows—which it could, if it is acceptable to kill either only the parent process, or the whole process tree, though whether it should is another question. Because figuring out what to do about this issue entails figuring that out too, it would open kill_process up to that improvement—dropping its vestigial Windows code if it is not going to support Windows—and possibly others.

@Byron
Copy link
Member

Byron commented Dec 4, 2023

Thanks a lot for bringing this up, and for all the detective-work that went into this incredibly thorough analysis!

From todays point of view, I think trying to kill child processes like that is utterly unacceptable as it's obviously racy. From my experience, on MacOS, sending a signal to the parent process also sends it to child processes - gitoxide for instance has to manage filter programs or longer-running git invocations to serve up a pack for cloning, and sending a signal to gix (which it intercepts) leads to gix failing with an IO error as the spawned child process terminated unexpectedly, so it tries to keep reading from a closed pipe. Thus I believe trying to meddle with child processes shouldn't be needed in the first place if GitPython would send a signal to its own PID, and then ignore it. I'd expect Windows to work similarly, but also wouldn't be surprised if it does things differently - it's a separate problem though.

Thus, I truly hope there is ways to use signals properly instead of trying to be even more elaborate here.

If for some reason it's not possible to send signals to the parent process, one could at least get the IDs of spawned child processes for killing them later. The Rust standard library makes it as easy as calling kill() and I'd expect Python to have something similar.

All in all, I really hope that kill_after_timeout can be unified so the existing tests cover the only remaining implementation, and improved so not be racy while maybe even gaining complete platform support.


On macOS 13 (on a GitHub Actions CI runner with tmate), I created this script in a directory in $PATH, named it git-sleep, and marked it executable:

Just wanted to amend that I love this approach - reverse shells are so powerful and even though I never tried it, I absolutely will once an opportunity presents itself. Thanks so much for sharing!

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Dec 5, 2023

From my experience, on MacOS, sending a signal to the parent process also sends it to child processes - gitoxide for instance has to manage filter programs or longer-running git invocations to serve up a pack for cloning, and sending a signal to gix (which it intercepts) leads to gix failing with an IO error as the spawned child process terminated unexpectedly, so it tries to keep reading from a closed pipe.

I don't think sending a signal to a specific process should automatically cause its children to receive it. Is it possible that you were sending it in a way that really sent it to a process group? For example, pressing Ctrl+C in a terminal sends SIGINT to that terminal's foreground process group. Then all the processes in the group receive it. When a child process is created, it has the same process group as its parent, which remains the case unless/until its group is changed (usually with setpgid(2) or, to also make it the leader of a new session, with setsid(2)). A process group ID is equal to the process ID of the first process in the group, and kill(2) will send a signal to process group n if it is passed -n as its sig argument, so this is another way sending a signal to a single process can resemble sending one to a process group.

This small C program demonstrates that, with a process tree P → Q → R, when P terminates Q, R continues running. This seems to work the same on macOS as other Unix-like systems. However, none of what I am saying here is necessarily applicable to Windows.

Thus I believe trying to meddle with child processes shouldn't be needed in the first place if GitPython would send a signal to its own PID, and then ignore it.

GitPython's own PID is really the PID of whatever application is using the GitPython library, so I think anything that could cause all of that program's subprocesses to be terminated, including subprocesses that are not related to GitPython, should be strongly avoided.

  • Sending a signal to the process group of the process in which GitPython's code is running, with the idea of terminating any processes that do not ignore the signal, should likewise be avoided. For situations where nothing will be harmed by sending a signal to the process's whole process group, calling kill with 0 as the pid will do this. But no matter how it is done, although subprocesses in separate process groups would be spared, GitPython-unrelated subprocesses in the same process group (as most would be) would be terminated, as would non-descendant processes in the same process group, such as the parent, siblings, etc., of the process using GitPython.
  • Putting subprocesses in a new process group--as shells do when they are running with job control ("monitor mode") enabled--and sending a signal to that group, would avoid this problem. But that would make it so a user or separate program that kills the process group in which GitPython's code is running would fail to kill those subprocesses. I don't want to say of this much safer approach that it definitely shouldn't be done, but it's less than ideal, because those subprocesses are effectively workers of GitPython. Commands that daemonize can, of their own accord, split off into a new process group if that is appropriate, but otherwise one would expect invocations of git commands to run in the same process group as the caller.

If for some reason it's not possible to send signals to the parent process, one could at least get the IDs of spawned child processes for killing them later.

If you're talking about direct children of the process using GitPython (the git processes run directly by GitPython), then in addition to being feasible (GitPython is already doing it), this carries no inherent race condition. A PID can only be reassigned once the process's entry in the system process table has been removed. That only happens when is parent has waited on it. Zombie processes--those that have died but have not yet been reaped by their parents waiting on them--are retained in the process table to ensure their PIDs cannot be reused prematurely. (If the parent dies first, the child is reparented to the init process, which waits on it.) GitPython calls wait on Popen objects representing child processes. Although I am not confident that there are no specific areas where a signal could be sent after waiting instead of before, it seems to me that the design itself is generally okay, and if any such areas do exist, they could be fixed.

This is in contrast to the handling of indirect subprocesses, where because the process in which GitPython runs did not create them, it is not the process that waits for them, and therefore it cannot ensure they have not died and been waited on, when sending signals to them by PID.

The Rust standard library makes it as easy as calling kill() and I'd expect Python to have something similar.

Yes, GitPython is using os.kill when working with PIDs in the code this issue concerns. (Nearby and elsewhere, it uses the kill and terminate methods of Popen objects when working with such objects. Like os.kill, those use kill(2) under the hood on Unix-like systems.) As far as I know, GitPython never kills processes on a Unix-like system by calling external commands; it only uses the external ps command to find out about processes. In the kill_process callback this issue concerns:

GitPython/git/cmd.py

Lines 1025 to 1030 in f0e7e41

os.kill(pid, sig)
for child_pid in child_pids:
try:
os.kill(child_pid, sig)
except OSError:
pass

Just wanted to amend that I love this approach - reverse shells are so powerful and even though I never tried it, I absolutely will once an opportunity presents itself. Thanks so much for sharing!

You're welcome! In case you're interested, the repository for that C program I mentioned has the tmate debugger set up for optional use when using the workflow_dispatch trigger. Starting the "Run experiment" workflow from the Actions tab has a checkbox to specify whether the tmate step should run or not. However, GitHub Actions might not offer you a way to use workflow_dispatch; you might have to fork the repository and run it in your fork. Of course you are not obligated to do this! Also, this may not necessarily be the most appealing or interesting way to try out reverse shells with GitHub Actions. (In particular, for GNU/Linux, an easier cloud-based way to use it is a codespace, which also doesn't require forking the repo.) But it's set up in that repository, so I figured I should mention it, given what you had said.

@Byron
Copy link
Member

Byron commented Dec 5, 2023

Thanks so much for the clarification!

Indeed, I believe I was confused by process groups which naturally do the right thing. And even if that wouldn't happen, sub-processes would naturally shut-down once they fail to write their output as the parent process closed the pipe.

And yes, it's quite foolish to to assume a library should send signals to its parent process under the assumption it owns it.

However, it seems like process-groups, assuming these are inherited automatically even for processes that aren't spawned by GitPython, could be a good way to send signals to everyone concerned in a race-free manner. Probably there are a lot of intricacies to figure out, but at least in theory, it should be one way to avoid having to 'list-and-kill' anything. Even if that makes sub-processes unkillable by others, it seems like having it as option seems like a step forward.

Maybe there are other combinations of features that I am not seeing that will work, and will even be portable to Windows in some shape or form.

As a personal note, it's amazing how I keep forgetting about process groups and all the 'magic' shells are doing to make process control seem so natural. It's easy to conflate this with the much more naive process control that one then applies in applications or libraries.

Thanks as well for the test-program and the elaborate CI setup, it was a pleasure to take a look.

@EliahKagan
Copy link
Contributor Author

Thanks so much for the clarification!

You're welcome! Double-checking the details of this was also a good review for me.

However, it seems like process-groups, assuming these are inherited automatically even for processes that aren't spawned by GitPython, could be a good way to send signals to everyone concerned in a race-free manner.

This may be the best available option. I expect that, in addition to avoiding a race condition on PIDs, it would also allow the code to be simpler than other approaches. They will be inherited automatically by indirect subprocesses, such as when one git command calls another. The disadvantage is that it breaks that generally correct assumption for GitPython, or, rather, for the program using GitPython. Sending a signal to that program would ordinarily be expected to send it to the GitPython-spawned git processes and their own helper subprocesses, yet that would no longer automatically happen. As noted (and as you allude to), to avoid sending signals to subprocesses that should not be terminated, we must not send a signal our own process group. So a descendant we want to send signals to (along with its descendants) would go in a separate process group, if we are sending them to whole groups.

Probably there are a lot of intricacies to figure out, but at least in theory, it should be one way to avoid having to 'list-and-kill' anything. Even if that makes sub-processes unkillable by others, it seems like having it as option seems like a step forward.

My guess is that the main intricacy to figure out might be whether GitPython should handlers for signals like SIGTERM that forward the signal to subprocesses in those otherwise less readily killed process groups. Of course, this would not apply to signals that can't be trapped, such as SIGKILL. I think I haven't read all the relevant signal-handling code in gitoxide--and also I don't know Rust, so my comprehension is decidedly imperfect--but what I did read of it reminded me that registering signal handlers in a library may need to be opt-in, since it could interfere with some ways of handling or ignoring signals that some applications using GitPython might be doing.

Maybe there are other combinations of features that I am not seeing that will work, and will even be portable to Windows in some shape or form.

Windows works differently in some significant ways, but it looks like this approach of putting the subprocess in its own process group is already being used on Windows:

GitPython/git/cmd.py

Lines 231 to 236 in 2b69bac

if os.name == "nt":
# CREATE_NEW_PROCESS_GROUP is needed to allow killing it afterwards. See:
# https://docs.python.org/3/library/subprocess.html#subprocess.Popen.send_signal
PROC_CREATIONFLAGS = subprocess.CREATE_NO_WINDOW | subprocess.CREATE_NEW_PROCESS_GROUP
else:
PROC_CREATIONFLAGS = 0

Based on the note in the linked documentation, however, I am unsure exactly what is going on there, because it doesn't seem to say terminate only works in that situation. That seems more related to SIGINT. I'll try to look into whether it's obsolete and whether, if not, it can be documented more clearly, as well as more broadly how closely process groups in Windows work to those in a Unix-like system. Physical signs in my copy of Windows via C/C++ suggest I have read and taken the time to understand the Processes chapter, but my recall of their details on Windows suggests otherwise. In any case, I expect that (and the API reference) to shed some light on that. However, I'm unsure of the time frame in which I will look into this, and there is some other stuff--including related to GitPython--that I would likely do first.

Thanks as well for the test-program

You're welcome! Thanks for taking a look.

and the elaborate CI setup

This may not be what you're referring to, and I don't know if you had looked at it during the time I had some really overcomplicated and bad YAML code for trying to customize a step title based on the event trigger, but I have fortunately fixed that. :)

@Byron
Copy link
Member

Byron commented Dec 8, 2023

--but what I did read of it reminded me that registering signal handlers in a library may need to be opt-in, since it could interfere with some ways of handling or ignoring signals that some applications using GitPython might be doing.

Indeed, signal handling support is opt-in in gitoxide and needs to be feature-toggled on, i.e. chosen at compile-time.

For GitPython, it would probably be the same, but if so, the alternative code-path would have to remain the same. Then it's the question how many people opt-in to the new behaviour, even if it's better they might simply not try.

It's also still a bit strange to imagine what would happen if the parent-process is terminated, even though it should have a child-process group that it manages. I would expect something to happen with that one, too, automatically, to honor the parent-child relationship.

Physical signs in my copy of Windows via C/C++ suggest I have read and taken the time to understand the Processes chapter, but my recall of their details on Windows suggests otherwise.

😁

I think I'd need to write my own cross-platform shell to finally understand how all that is really working, and that's not going to happen anytime soon 😅.

However, I'm unsure of the time frame in which I will look into this, and there is some other stuff--including related to GitPython--that I would likely do first.

Yes, I agree that this topic here is sufficiently complex to better find lower-hanging and maybe even more valuable topics to work on. I will surely be learning a lot once you do tackle this issue, so I am looking forward to when that happens.

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Dec 8, 2023

It's also still a bit strange to imagine what would happen if the parent-process is terminated, even though it should have a child-process group that it manages. I would expect something to happen with that one, too, automatically, to honor the parent-child relationship.

I wouldn't expect anything to happen automatically. Unless you mean you would expect the parent process to install a signal handler to do something on a best-effort basis.

Although this is a contrived example, this sort of thing can be useful:

ek@Kip:~$ sh -mc 'python3.11 -c "import time; time.sleep(1000)" & ps -o pid,ppid,pgid,cmd'
  PID  PPID  PGID CMD
 3204 17907  3204 sh -mc python3.11 -c "import time; time.sleep(1000)" & ps -o pid,ppid,pgid,cmd
 3205  3204  3205 python3.11 -c import time; time.sleep(1000)
 3206  3204  3206 ps -o pid,ppid,pgid,cmd
17907 17906 17907 -bash
ek@Kip:~$ ps -o pid,ppid,pgid,cmd
  PID  PPID  PGID CMD
 3205     1  3205 python3.11 -c import time; time.sleep(1000)
 3211 17907  3211 ps -o pid,ppid,pgid,cmd
17907 17906 17907 -bash

Whether or not a child process is in a new process group, I don't think it would be a generally desirable default for the system to send a signal to it automatically when its parent terminates. This at least seems to me to be at odds with the Unix design decisions made in how signals work with parent and child processes. The way the parent-child relationship is reflected in signal handling is instead the other way around: when a child process exits (including due to receiving a signal), is stopped, or is resumed, its parent is sent SIGCHLD. (This is one of the few signals that is ignored by default.)

I think I'd need to write my own cross-platform shell to finally understand how all that is really working, and that's not going to happen anytime soon 😅.

I think that, in practice, if by cross-platform you mean "not just Unix," then cross-platform shells abstract away from the way processes and signals work, for at least some of the operating systems they target. For Unix-style shells, the high-level abstractions shells provide, like pipes and jobs, tend to correspond roughly to the abstractions provided by the kernel. Native ports of such shells to Windows work around the gaps one way or another. Often the ports are not native, but instead rely on a separate translation layer, like cygwin1.dll or msys-2.0.dll.

It is true, though, that writing a shell should entail engaging with any of those details that are not handled by a translation layer, on whatever platforms are targeted. Writing a fully functional shell is, I believe, quite difficult. Sometimes people write very basic shells. A less time-consuming option, if one is interested, might be to examine the code of a shell that is production-quality but far simpler than most shells, such as the Almquist Shell implementations dash or busybox ash. Of course, these are not cross-platform shells. I've seen that a few people have written, or partly written, POSIX shells in Rust, which interests me--as a way of learning about Rust--but I have not looked into them.

Yes, I agree that this topic here is sufficiently complex to better find lower-hanging and maybe even more valuable topics to work on.

I think the greatest complexity pertains to avoiding a breaking change, predicting the effect on existing use, and dealing with unanticipated breakages (that might be a result of an unintended breaking change and thus a bug, or a result of ill-founded assumptions baked into some code that uses GitPython, or a combination of the two).

It might be worthwhile to insert an additional caveat into the part of the Git.execute docstring that documents kill_after_timeout, about how it is not guaranteed to kill indirect subprocesses, or varies across operating systems in how well it does so, etc. But I have not thus far thought of a clear and succinct way of expressing this that also decisively avoids both (a) making or appearing to make more promises than the docstring already makes, and (b) presenting details that should not be relied on because they could change in a patch version and that would also impose a greater documentation maintenance burden. If I think of something for this, maybe I'll open a PR for such a docstring change.

In addition, since really improving this area of the code may wait a while, I may try to figure out a way to make clearer that the code in kill_process must not be used on Windows. No code path currently causes that to happen (aside from extreme acts like monkey-patching os.name, which GitPython can't be responsible for anyway). But I am worried that, as it currently stands, such a bug may eventually arise inadvertently as a result of future changes. This is in view of the appearance that it does work on Windows: special-casing for Windows in process creation flags and, even more explicitly, falling back to signal.SIGTERM when signal.SIGKILL is absent as it is on Windows, with a comment saying this is being done for Windows.

Running ps, as done there, and attempting to kill processes whose PIDs appear in the first column of the output, is dangerous on Windows, for multiple reasons. One reason is that, while no ps command usually exists on Windows, when a ps command does exist on a Windows system, I believe it is in practice often provided by a Cygwin-like environment (Cygwin, MSYS2, Git Bash, etc.). Those list both PIDs and WINPIDs, with PIDs listed by default in the first column, and the PIDs are not actual Windows PIDs, so killing them may be killing some totally different process, no race condition required. Fortunately, passing --ppid produces an error, but if the code is changed not to use --ppid and some code path calls it on Windows (changes that I would be expect end up done together), this would send SIGTERM to arbitrary processes on Windows.

This would still not be a problem when actually running in a Cygwin build of the Python interpreter, because facilities such as os.kill use the cygwin1.dll-provided kill "system call," but one may add such environments' bin directories to one's path for convenience; for example, ps.exe on my Windows system is C:\msys64\usr\bin\ps.exe and its PIDs are concocted by the msys-2.0.dll translation layer.

I will surely be learning a lot once you do tackle this issue, so I am looking forward to when that happens.

I don't promise definitely to do so. For one thing, perhaps someone else will come along and contribute the improvement first! If not, however, then I hope to do it eventually.

@Byron
Copy link
Member

Byron commented Dec 9, 2023

Thanks so much for sharing your insights on signals and shells. As always, way ahead of me :)!

If I think of something for this, maybe I'll open a PR for such a docstring change.

This feels like a good first step towards enabling changes to that machinery, and to fill in documentation that better explains the current implementation along with its shortcomings that one way or another people might rely upon. Of course, it's only good if people read it and check its applicability to their own usage, and even then it will be unclear if they unknowingly rely on side-effects. No matter how I think about it, avoiding accidental breakage or surprises when touching this topic seems like a gamble for which one would wish to have a beta-track of sorts, or opt-in to possible future features, in the hopes that people use such a track and provide feedback. I let that topic rest here :D.

I may try to figure out a way to make clearer that the code in kill_process must not be used on Windows.

This sounds like it's definitely valuable!

EliahKagan added a commit to EliahKagan/GitPython that referenced this issue Dec 10, 2023
EliahKagan added a commit to EliahKagan/GitPython that referenced this issue Dec 10, 2023
This changes the code in Git.execute's local kill_process function,
which it uses as the timed callback for kill_after_timeout, to
remove code that is unnecessary because kill_process doesn't
support Windows, and to avoid giving the false impression that its
code could be used unmodified on Windows without serious problems.

- Raise AssertionError explicitly if it is called on Windows. This
  is done with "raise" rather than "assert" so its behavior doesn't
  vary depending on "-O".

- Don't pass process creation flags, because they were 0 except on
  Windows.

- Don't fall back to SIGTERM if Python's signal module doesn't know
  about SIGKILL. This was specifically for Windows which has no
  SIGKILL.

See gitpython-developers#1756 for discussion.
@EliahKagan
Copy link
Contributor Author

I've proposed both such changes in #1761. They could be refined further if need be.

Separately, I've noticed another bug related to when kill_process is used at all, which I don't think reasonably falls under this issue, so I've opened #1762 for it.

@EliahKagan
Copy link
Contributor Author

Possibly related to the limitations noted here and in #1762, it occurs to me to ask if using kill_after_timeout should be expected to fix (or, well, work around) problems like #1642.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants