-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defunct wal-redo processes #2761
Comments
Looking into this. |
My best lead on this is the following scenario:
Since we keep each tenant's Step 7 will cause a SIGPIPE / EPIPE in the wal-redo process. It will exit. But the pageserver doesn't Validating the hypothesisTBD. Do we need to? Seems pretty obvious to me. Fixing the problemApproach 1 (quick fix)We can add a Approach 2 (more robust, long-term maintainable)The alternative I have in mind: Handle pattern. Usage:
Implementation sketch:
If we think threads are too heavyweight, I guess I could have a look at ways to do it in async Rust. Opinions, @hlinnaka ? (he wrote the current implementation) |
Some things to consider:
|
The rust |
Approach 3:
|
I did a quick experiment with the Here's how the linker error looks like:
Before spending more time on reverse-mapping linker error to drop-site, I wanted to make sure that we would be ready to take action once we have the drop-site. Here's how that looks like in the
What we see there is that the dropper was the GC thread. We could now modify the GC thread to do explicit teardown, like so:
But I think this breaks encapsulation, because we'd have to do that in every place that might drop a Tenant/Timeline/WalRedoManager. My take-away from this is that So, we shouldn't pursue Approach 3, even if the linker errors were more useful by pointing to the drop site. |
+1 for managing with a tokio task, I remember starting to look at the differences at the offsite to the current implementation towards a tokio managed process, along the lines of approach 2 but with a tokio task instead of a separate thread. I didn't then or still understand the signal safe fd closing. While adding the handle and a channel will add some memcpy overhead, it will play nicely into making the processes poolable in future. |
… path If we're not calling kill() before dropping the PostgresRedoProcess, we currently leak it. That's most likely the root cause for #2761. This patch 1. adds an error log message for that case and 2. adds error handling for all errors on the kill() path. If we're a `testing` build, we panic. Otherwise, we log an error and leak the process. The error handling changes (2) are necessary to conclusively state that the root cause for #2761 is indeed (1). If we didn't have them, the root cause could be missing error handling instead. To make the log messages useful, I've added tracing::instrument attributes that log the tenant_id and PID. That helps mapping back the PID of `defunct` processes to pageserver log messages. Note that a defunct process's `/proc/$PID/` directory isn't very useful. We have left little more than its PID. Once we have validated the root cause, we'll find a fix, but that's still an ongoing discussion. refs #2761
… path If we're not calling kill() before dropping the PostgresRedoProcess, we currently leak it. That's most likely the root cause for #2761. This patch 1. adds an error log message for that case and 2. adds error handling for all errors on the kill() path. If we're a `testing` build, we panic. Otherwise, we log an error and leak the process. The error handling changes (2) are necessary to conclusively state that the root cause for #2761 is indeed (1). If we didn't have them, the root cause could be missing error handling instead. To make the log messages useful, I've added tracing::instrument attributes that log the tenant_id and PID. That helps mapping back the PID of `defunct` processes to pageserver log messages. Note that a defunct process's `/proc/$PID/` directory isn't very useful. We have left little more than its PID. Once we have validated the root cause, we'll find a fix, but that's still an ongoing discussion. refs #2761 closes #2769
The change #2769 that adds more logging got deployed. Analyzing logs on one of the affected pageserver hosts: admin@zenith-us-stage-ps-4:~$ journalctl --since 08:00 --until 09:00 -u pageserver --no-pager | grep "dropping PostgresRedoProcess that wasn't killed, likely causing defunct postgres process" | perl -n -e '/pid=(\d+)/ && print $1 . "\n" ' | sort > /tmp/candidates.txt
admin@zenith-us-stage-ps-4:~$ ps -e -o pid,stime,args | grep postgres | grep defunct | grep ' 08:' |awk '{print $1}' | sort > /tmp/defunct.txt
admin@zenith-us-stage-ps-4:~$ diff -u /tmp/defunct.txt /tmp/candidates.txt
admin@zenith-us-stage-ps-4:~$ echo $?
0
admin@zenith-us-stage-ps-4:~$ wc -l /tmp/defunct.txt
250 /tmp/defunct.txt What this means: all of the 250 defunct postgres processes that were started between 08:00 to 09:00 were caused by the problem I'm optimistic that we have found the root cause. |
This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes #2761 closes #2776
This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes #2761 closes #2776
Currently on prod:
The text was updated successfully, but these errors were encountered: