FileResponse + S3File + gzip + WSGI = 💥 #1348

bblanchon · 2024-01-11T14:14:28Z

Hi,

Thank you very much for this project.
I've been using it for years to store uploaded files on S3.

Please apologize for the long text. I'm not used to this format, but I wanted to include all the relevant information so you don't have to do this research again. I also wanted to include some workarounds until a definite solution can be implemented in the library.

Description of the problem

In the past, I used querystring_auth to let users download private files from S3, but nowadays, I prefer using a Django view as a proxy.

@login_required
def user_biography(request):
    return FileResponse(request.user.biography.open())

As you can see, it's a straightforward view that verifies user permission and sends the file's content. Of course, it slightly increases the server load compared to querystring_auth, but it allows me to offer nice, non-expiring URLs.

I've been using this solution for almost three years and had no problem until I tried to enable the gzip compression (using AWS_IS_GZIPPED).

Indeed, when compression is enabled, the HTTP response body contains the compressed file, but the Content-Length is set to the uncompressed file size, and the Content-Encoding header is not set.

This problem only occurs when using a WSGI server and not with Django's runserver command.

How is this possible?

The documentation of FileResponse mentions wsgi.file_wrapper, an optimization that allows the WSGI server to send the file directly. Let's see how this feature ends up with the compressed file sent in the response.

It all begins when FileResponse sets an attribute file_to_stream pointing to the S3File. Django's WSGI layer detects this attribute and wraps the S3File with the class "wsgi.file_wrapper" provided by the WSGI server.

Now let's see what happens in the WSGI server; I'll take gunicorn as an example, but it's safe to assume that all WSGI implementations will have the same problem.

gunicorn detects that the response contains a wsgi.file_wrapper and checks that the provided file-like object supports fileno. gunicorn then happily sends the file using sendfile(). As we'll see in a moment, this fileno corresponds to the temporary file downloaded from S3. This file is still compressed, which explains why the server returns a compressed file.

So how could gunicorn get the fileno of the compressed file?
As we saw, it received the S3File wrapped with wsgi.file_wrapper.
S3File derives from File.
File derives from FileProxyMixin.
FileProxyMixin forwards all calls to the underlying file object, including fileno.

But what is the underlying file object?
If the S3 file is not compressed, the underlying object is a SpooledTemporaryFile, and everything works fine.
But if the file is compressed, the underlying object is a GzipFile wrapping the SpooledTemporaryFile.
When asked for a fileno, the GzipFile returns the one of the underlying object.
In other words, it returns the fileno of the uncompressed file, which explains how gunicorn got it.

How to fix this issue?

Ideally, we would need to remove the fileno property from S3File when the file is compressed. I don't think one can dynamically remove a property from a class, but we could remove fileno from S3File and add it only when the file is not compressed. However, doing so would require breaking the inheritance from File, and I don't know the implications.

Looking at the code of gunicorn, you could be tempted to throw an exception from fileno. Unfortunately, this behavior isn't mentioned in the specification and isn't supported by other WSGI implementations.

How to work around this issue?

Until there is a proper fix, here are a few ways to work around this problem.

Workaround 1: Disable GZIP compression

Either set the gzip storage option or AWS_IS_GZIPPED to False.

Cons: Files aren't compressed anymore

Workaround 2: Use `StreamingHttpResponse` instead of `FileResponse`.

@login_required
def user_biography(request):
    return StreamingHttpResponse(request.user.biography.open())

Cons: The WSGI server can no longer use sendfile, even for uncompressed files.

You could also use FileResponse by default and switch to StreamingHttpResponse for compressed files, but it's a bit more complicated and strongly couples your code with the storage backend.

Workaround 3: Disable `sendfile` on the WSGI server

Pass --no-sendfile to gunicorn, or --disable-sendfile to uwsgi.

Cons: The WSGI server can no longer use sendfile.

Conclusion

As a conclusion to this extremely long issue (or should I say blog post), let me tell you how I fixed my existing apps:

For the applications where the gzip compression is essential (CSV files, for example), I used workaround 2: StreamingHttpResponse.
For the applications where gzip compression is not essential (images, for example), I use the workaround 1: disabling the compression.

That's all I had to say about that.
I hope it will benefit other users of django-storages, and who knows, maybe inspire a proper fix in the library.

Best regards,
Benoit

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileResponse + S3File + gzip + WSGI = 💥 #1348

FileResponse + S3File + gzip + WSGI = 💥 #1348

bblanchon commented Jan 11, 2024

FileResponse + S3File + gzip + WSGI = 💥 #1348

FileResponse + S3File + gzip + WSGI = 💥 #1348

Comments

bblanchon commented Jan 11, 2024

Description of the problem

How is this possible?

How to fix this issue?

How to work around this issue?

Workaround 1: Disable GZIP compression

Workaround 2: Use StreamingHttpResponse instead of FileResponse.

Workaround 3: Disable sendfile on the WSGI server

Conclusion

Workaround 2: Use `StreamingHttpResponse` instead of `FileResponse`.

Workaround 3: Disable `sendfile` on the WSGI server