Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zdaemon transcript thread dies if disk is full #1

Open
mgedmin opened this issue Mar 14, 2013 · 4 comments
Open

zdaemon transcript thread dies if disk is full #1

mgedmin opened this issue Mar 14, 2013 · 4 comments

Comments

@mgedmin
Copy link
Member

mgedmin commented Mar 14, 2013

Two days ago I had a bit of a problem with a disk full. Today my entire web server hung and stopped processing requests.

Long story short, when zdaemon's Transcript thread gets an IOError while writing to a log file, it just dies. zdaemon itself and the child program it is managing remain running. The child's stdout/stderr are now pointing to a pipe that is still open at both ends, but now no process ever reads from it. Things seem to run fine for a couple of days, then the kernel's pipe buffer becomes full and the child blocks in write(). While holding the logging mutex. Fun for the whole family, conveniently delayed from the root cause you think you already fixed.

@tseaver
Copy link
Member

tseaver commented Apr 23, 2013

Would changing the code in copy()[1] to use the lock as a context manager make the pain go away? E.g.:

def copy(self):
    lock = self.lock
    i = [self.read_from]
    o = e = []
    while 1:
        ii, oo, ee = select.select(i, o, e)
        with lock:
            for fd in ii:
                self.write(os.read(fd, 8192))

[1] https://github.com/zopefoundation/zdaemon/blob/master/src/zdaemon/zdrun.py#L602

@tseaver
Copy link
Member

tseaver commented Apr 23, 2013

@mgedmin
Copy link
Member Author

mgedmin commented Apr 24, 2013

It's a good change, but it wouldn't help with this issue. The deadlock was not caused by waiting on a lock, it was caused by the child process blocking on a write to a full pipe with no readers.

What might help would be a try/finally (or a with block) that closes the self.read_from pipe. Then the child process would die with OSError(errno.EPIPE) instead of blocking some undetermined time later.

To actually make zdaemon survive a temporary disk full condition we'd have to add a loop around self.write that retries until the write succeeds. I'm not sure that's even possible using Python file objects (do we get the number of bytes successfully written by the underlying syscall?).

mgedmin added a commit that referenced this issue Apr 15, 2015
This could prevent deadlocks like #1, although not that specific one.
mgedmin added a commit that referenced this issue Apr 15, 2015
If file permissions change, reopening will fail, and we'll end up
getting an exception, a closed self.file, and a background thread that
will die next time it tries to write to the now-closed file, leaving the
child process writing to a pipe until the pipe buffer fills up at some
point in the future (see also bug #1).
mgedmin added a commit that referenced this issue Apr 15, 2015
Test plan:

- mount a small tmpfs with 'sudo mount -t tmpfs none /mnt -o size=4096'

- create a zdaemon config file
    <runner>
      program yes
      transcript /mnt/transcript.log
    </runner>

- run zdaemon -C conf start

- wait a few milliseconds for /mnt to fill up

- pgrep yes

If the 'yes' program is still running, we have a deadlock (strace and
you'll find it blocked on write()).  This is the situation before this
patch, as described in bug #1.

If the 'yes' program is dead, the deadlock is fixed.

- run zdaemon -C conf status

The daemon manager should be stopped (it's not functional without the
dead transcript thread).

Automating this test is left as an exercise for the reader.  :(
@mgedmin
Copy link
Member Author

mgedmin commented Apr 15, 2015

I'm not 100% happy with this: the death of the transcript thread ought to produce a log message somewhere (ha! as if that's possible if the disk is full, but anyway, there might be other causes) and it should cause the daemon manager to die ASAP, instead of waiting for the child process to try and log something to stdout. But I'm out of round tuits for today: I came to fix #8, and ended up spending all day on zdaemon.

@mgedmin mgedmin closed this as completed Apr 15, 2015
@mgedmin mgedmin reopened this Apr 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants