Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOError: (9, 'Bad file descriptor') gunicorn-19.5.0 #1259

Closed
papachoco opened this issue May 12, 2016 · 18 comments · Fixed by #1263
Closed

IOError: (9, 'Bad file descriptor') gunicorn-19.5.0 #1259

papachoco opened this issue May 12, 2016 · 18 comments · Fixed by #1263

Comments

@papachoco
Copy link

Hi,

There is an exception that does not happen in 19.4.5 when trying to create a unix socket

In my application I am trying to create a two socket.

[('', 8081), '/Users/csanchez/var/dataserver.sock']

The first socket is created correctly and the second does not. The offending line is at 114. The reason may be because the fd is None

     buildout/eggs/gunicorn-19.5.0-py2.7.egg/gunicorn/sock.py(112)__init__()
        110                     raise ValueError("%r is not a socket" % addr)
        111         self.parent = os.getpid()
    --> 112         super(UnixSocket, self).__init__(addr, conf, log, fd=fd)
        113         # each arbiter grabs a shared lock on the unix socket.
        114         fcntl.lockf(self.sock, fcntl.LOCK_SH | fcntl.LOCK_NB)

fd is None because the call to create the sockets never sets it.

/Users/csanchez/Documents/workspace/nti.dataserver-buildout/eggs/gunicorn-19.5.0-py2.7.egg/gunicorn/sock.py(227)create_sockets()
225 for i in range(5):
226 try:
--> 227 sock = sock_type(addr, conf, log)

...

In gunicorn-19.4.5 the call fcntl.lockf(self.sock, fcntl.LOCK_SH | fcntl.LOCK_NB) never happens

Thanks,

Carlos
trace.txt

@benoitc
Copy link
Owner

benoitc commented May 12, 2016

Can you paste the full error log you get?

@benoitc
Copy link
Owner

benoitc commented May 12, 2016

also what does mean:

In my application I am trying to create a two socket.

[('', 8081), '/Users/csanchez/var/dataserver.sock']

Do you try to embed gunicorn?

@benoitc
Copy link
Owner

benoitc commented May 12, 2016

@papachoco forget my second comment :) On which OS do you get the error right now?

@jamadden
Copy link
Collaborator

The error is being seen on OS X, El Capitan, I believe.

@jamadden
Copy link
Collaborator

I suspect you've already done this, but I can also reproduce the error on OS X. There is no specific error message (which seems wrong), instead you just get this:

$ cat /tmp/gun.py
bind = [":8081", "unix://tmp/dataserver.sock"]
$ gunicorn -c /tmp/gun.py myapp:myapp
[2016-05-12 06:13:34 -0500] [83702] [INFO] Starting gunicorn 19.5.0
[2016-05-12 06:13:34 -0500] [83702] [ERROR] Retrying in 1 second.
[2016-05-12 06:13:35 -0500] [83702] [ERROR] Retrying in 1 second.
[2016-05-12 06:13:36 -0500] [83702] [ERROR] Retrying in 1 second.
[2016-05-12 06:13:37 -0500] [83702] [ERROR] Retrying in 1 second.
[2016-05-12 06:13:38 -0500] [83702] [ERROR] Retrying in 1 second.

If I edit gunicorn/sock.py to change log.error("Retrying...") to log.exception, we get much more useful output:

[2016-05-12 06:17:08 -0500] [83832] [ERROR] Retrying in 1 second.
Traceback (most recent call last):
  File "/.../lib//python3.4/site-packages/gunicorn/sock.py", line 227, in create_sockets
    sock = sock_type(addr, conf, log)
  File "/.../lib/python3.4/site-packages/gunicorn/sock.py", line 114, in __init__
    fcntl.lockf(self.sock, fcntl.LOCK_SH | fcntl.LOCK_NB)
OSError: [Errno 9] Bad file descriptor

@benoitc
Copy link
Owner

benoitc commented May 12, 2016

@jamadden actually i'm only reproducing it on osx :) this is why I ask. So I guess the lock needs to be handled diffrently on OSX but not sure how yet.

@jamadden
Copy link
Collaborator

jamadden commented May 12, 2016

As far as I can determine experimentally, OS X just doesn't support locking on unix domain sockets.

The call to fcntl.lockf boils down to the system call fcntl(fd, F_SETLK, ...) Using dtruss, I can see gunicorn make that call:

fcntl(0x7, 0x8, 0x7FFF55593BA0)      = -1 Err#9

0x7 is our fileno, 0x8 is F_SETLK, and errno 9 is EBADF.

Here's a minimal C program that demonstrates the same failure on OS X:

#include <fcntl.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/un.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char* argv[])
{
    int fd = socket(PF_UNIX, SOCK_STREAM, 0);
    printf("Opened fd %d\n", fd);

    struct sockaddr_un addr = { .sun_path = "/tmp/sock.sock"};
    addr.sun_len = strlen(addr.sun_path); // OS X Only!
    addr.sun_family = PF_UNIX;

    int bound = bind(fd, &addr, sizeof(addr));
    printf("Bound %d\n", bound);

    struct flock lock = {0, 0, getpid(), F_RDLCK, 0};
    int lock_res = fcntl(fd, F_SETLK, &lock);
    printf("Lock %d\n", lock_res);

    if (lock_res < 0)
        perror("Failed to lock:");

    return 0;
}

When run, this outputs:

Opened fd 3
Bound 0
Lock -1
Failed to lock:: Bad file descriptor

I'm looking for where this is documented. The closest I've got so far is the bottom of the fcntl man page:

[EBADF]     Fildes is not a valid open file descriptor.
            The argument cmd is F_SETLK or F_SETLKW, the type of lock (l_type)
            is a shared lock (F_RDLCK), and fildes is not a valid file descriptor open for reading.

@jamadden
Copy link
Collaborator

jamadden commented May 12, 2016

The POSIX specification for fcntl says:

Record locking shall be supported for regular files, and may be supported for other files.

A socket is not a regular file, so locking isn't required to be supported on them.

@jamadden
Copy link
Collaborator

jamadden commented May 12, 2016

Changing fcntl.lockf to fcntl.flock (which the man page recommends because of its better close semantics) results in a more clear error message:

[2016-05-12 06:59:42 -0500] [85492] [ERROR] Retrying in 1 second.
Traceback (most recent call last):
  File "//lib/python3.4/site-packages/gunicorn/sock.py", line 229, in create_sockets
    sock = sock_type(addr, conf, log)
  File "//lib/python3.4/site-packages/gunicorn/sock.py", line 116, in __init__
    fcntl.flock(self.sock, fcntl.LOCK_SH | fcntl.LOCK_NB)
OSError: [Errno 45] Operation not supported

man flock says:

 [ENOTSUP]          The referenced descriptor is not of the correct type.

So I guess we just can't do this on OS X with the socket descriptor. We could potentially create a regular file next to the socket to lock on...

@jamadden
Copy link
Collaborator

I also get the same failure on FreeBSD, so probably any member of the BSD family is affected.

@jamadden
Copy link
Collaborator

I also get the same failure on Solaris. I think locking on sockets is not very portable.

@benoitc
Copy link
Owner

benoitc commented May 13, 2016

@jamadden right. I will have a closer look today on that topic. We need a more portable way to fix #1185.

The code responsible of unlinking is the following:
https://github.com/benoitc/gunicorn/blob/master/gunicorn/sock.py#L136

The thing I don't understand right now is why the file is deleted while another process is using it. Reading unix(7) [1]:

Binding to a socket with a filename creates a socket in the file system that must be deleted by the caller when it is no longer needed (using unlink(2)). The usual UNIX close-behind semantics apply; the socket can be unlinked at any time and will be finally removed from the file system when the last reference to it is closed.

and the related unlink(2) [2] manual:

unlink() deletes a name from the file system. If that name was the last link to a file and no processes have the file open the file is deleted and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have the file open the file will remain in existence until the last file descriptor referring to it is closed.

So as I read it the file shouldn't be deleted if the new process created via USR2 is using it. Which led to the question if there is a bug/race condition either inside Gunicorn when racing to open a new process and launching new accepting workers. Or a bug in Python somewhere.

[1] http://linux.die.net/man/7/unix
[2] http://linux.die.net/man/2/unlink

@jamadden
Copy link
Collaborator

jamadden commented May 13, 2016

The file isn't deleted, but the name of the file is. That is, the data (if there were any) still exists on disk, and can be read/written with open file descriptors, but trying to open or stat that name again will fail until someone re-creates it. Here's a demo that sets up this scenario:

# lnk.py
import os
# get a filename
tname = os.tmpnam()

# open and create it
tfile = open(tname, 'w')

rpipe, wpipe = os.pipe()

pid = os.fork()
if pid != 0:
    # parent
    os.read(rpipe, 1)
    try:
        print(os.stat(tname))
    finally:
        os.wait()
else:
    # child
    # remove the file
    # *that we have open*
    os.unlink(tname)
    # alert the parent
    os.write(wpipe, b'.')
    # die

After the second process unlinks the name, when the first process tries to stat (or do any operation on the name of the file), it fails:

$ python2.7 lnk.py
Traceback (most recent call last):
  File "/tmp/lnk.py", line 15, in <module>
    print(os.stat(tname))
OSError: [Errno 2] No such file or directory: '/var/tmp/tmp.0.txHLl2'

This is why functions that take a file descriptor, like fstat exist.

So for #1185, we have a race condition:

  1. First arbiter starts (socket file created)
  2. New arbiter starts via USR2 (all is well)
  3. First arbiter exits via TERM (socket file is removed)

After 3., the file name is gone. The new arbiter is still connected to the same file as the first arbiter was, but anyone subsequently trying to bind to that file name will get a new file created with the same name (if they have permissions) or an error (if they don't), and connect will fail because the file name doesn't exist (IIRC how unix domain sockets work).

@jamadden
Copy link
Collaborator

Or, here's the missing (implied) text from the manual:

unlink() deletes a name from the file system. After deleting the name, If that name was the last link to a file and no processes have the file open the file is deleted and the space it was using is made available for reuse.

@benoitc
Copy link
Owner

benoitc commented May 13, 2016

@jamadden indeed. In fact, the thing I forgot , is the case where an external application try to reopen this socket file via its name which doesn't exist anymore....

so the simplest solution fix that comes to my mind for now would be the following:

def lock_unix_socket(Path):
    """ create a file name `Path.LOCK` """"

def unlock_unix_socket(Path) ->
   """ delete a file name `Path.LOCK` """

we could then using fcntl.flock over this "normal file". The value retrieved from the file give us the number of processes that use it.

The workflow will be the following:

  1. When the socket is created (no fd passed), a lock file is created, if one exists it is simply removed. 1 is the reference count written on that file.
  2. On USR2, increment the reference count in the "lock" file
  3. When stopping a process, decrement the reference count from 1 in the "lock" file.
  4. If the reference count is 0, then unlink the path.

Other way would be retrieving from the system the number of processes actually using the file descriptor. This solution would be the best imo but I'm not sure how it's feasible cross-platform.

Thoughts?

@jamadden
Copy link
Collaborator

Other way would be retrieving from the system the number of processes actually using the file descriptor. This solution would be the best imo but I'm not sure how it's feasible cross-platform.

gevent does that in its test cases, and I can tell you that it's not portable :) The psutil package helps, but it's basically unix-only (which would be OK here).

That said, I'm not sure the refcount is needed. Wouldn't the same basic read/exclusive lock strategy work for the lock file, just like we were trying to do on the socket? I mean, we're going to have to take exclusive locks to increment/decrement the count safely as it is.

I also think it's safest/simplest to not ever try to unlink the lock file, just use the locks. Each process opens(O_CREAT) the file, sets the fd to not-inheritable, and takes a read lock; that should help eliminate race conditions. But people may object to the extra file lying around.

Of course, people may also object to the extra file just existing at all; I wonder if they'll want an option to control where it lives?

@benoitc
Copy link
Owner

benoitc commented May 14, 2016

we can create a new config setting for it. Right now i'm asking myself if wee need to create a new lockfile or improve the pidfile handling ?

benoitc added a commit that referenced this issue May 14, 2016
This change add proper file locking to gunicorn. By default "gunicorn.lock" is created in the temporary directory when a unix socket is bound.  In case someone want to fix the lock file path or use multiple gunicorn instance the "--lock-file" setting can be used to set the path of this file.

fix #1259
benoitc added a commit that referenced this issue May 14, 2016
This change add proper file locking to gunicorn. By default "gunicorn.lock" is created in the temporary directory when a unix socket is bound.  In case someone want to fix the lock file path or use multiple gunicorn instance the "--lock-file" setting can be used to set the path of this file.

fix #1259
benoitc added a commit that referenced this issue May 14, 2016
This change add proper file locking to gunicorn. By default "gunicorn.lock" is created in the temporary directory when a unix socket is bound.  In case someone want to fix the lock file path or use multiple gunicorn instance the "--lock-file" setting can be used to set the path of this file.

fix #1259
@benoitc
Copy link
Owner

benoitc commented May 14, 2016

@jamadden can you test #1263 and let me know?

benoitc added a commit that referenced this issue May 14, 2016
This change add proper file locking to gunicorn. By default "gunicorn.lock" is created in the temporary directory when a unix socket is bound.  In case someone want to fix the lock file path or use multiple gunicorn instance the "--lock-file" setting can be used to set the path of this file.

fix #1259
mjjbell pushed a commit to mjjbell/gunicorn that referenced this issue Mar 16, 2018
This change add proper file locking to gunicorn. By default "gunicorn.lock" is created in the temporary directory when a unix socket is bound.  In case someone want to fix the lock file path or use multiple gunicorn instance the "--lock-file" setting can be used to set the path of this file.

fix benoitc#1259
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants