Weasyprint consumes a lot of memory for long documents #671

vojkny · 2018-08-16T17:30:58Z

This is probably something in combination with gunicorn, most probably not having to do anything with weasyprint, but I am stuck and have no idea which way to go.

import logging

from flask import Flask, request, make_response, jsonify
from weasyprint import HTML

app = Flask('pdf')


@app.before_first_request
def setup_logging():
    logging.addLevelName(logging.DEBUG, "\033[1;36m%s\033[1;0m" % logging.getLevelName(logging.DEBUG))
    logging.addLevelName(logging.INFO, "\033[1;32m%s\033[1;0m" % logging.getLevelName(logging.INFO))
    logging.addLevelName(logging.WARNING, "\033[1;33m%s\033[1;0m" % logging.getLevelName(logging.WARNING))
    logging.addLevelName(logging.ERROR, "\033[1;31m%s\033[1;0m" % logging.getLevelName(logging.ERROR))

    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter(
        '%(asctime)s %(levelname)s: %(message)s '
        '[in %(pathname)s:%(lineno)d]'
    ))
    app.logger.addHandler(handler)
    app.logger.setLevel(logging.DEBUG)


@app.route('/', methods=['POST'])
def generate():
    name = request.args.get('filename', 'export.pdf')
    app.logger.info('PDF %s' % " ".join(request.data.decode('utf-8').split()).replace("> <", "><")[0:100])
    html = HTML(string=request.data)
    pdf = html.write_pdf()
    response = make_response(pdf)
    response.headers['Content-Type'] = 'application/pdf'
    response.headers['Content-Disposition'] = 'inline;filename=%s' % name
    return response

if __name__ == '__main__':
    app.run()

I run this script in a docker image which runs mostly fine, but I suspect all the weasyprint logs are thrown away as i never see them logged. Especially in case of an error, I just get an empty result with no error.

What kind of logging tool weasyprint uses? Is it somehow compatible with the way gunicorn logs stuff?

The text was updated successfully, but these errors were encountered:

vojkny · 2018-08-16T18:06:54Z

Also, is there a way to make weasyrpint more verbose programmatically?

liZe · 2018-08-16T19:19:28Z

What kind of logging tool weasyprint uses? Is it somehow compatible with the way gunicorn logs stuff?

WeasyPrint uses the default Python logger, you'll find more information in the Logging section of the documentation.

Also, is there a way to make weasyrpint more verbose programmatically?

Everything can be customized with the logging module. WeasyPrint is not really verbose, but it should be enough for your needs.

Especially in case of an error, I just get an empty result with no error.

Please check the logs and tell me if it helps!

vojkny · 2018-08-16T19:22:18Z

Everything can be customized with the logging module. WeasyPrint is not really verbose, but it should be enough for your needs.

I am just looking fro emulating the -v switch in commandline, which at least prints:

INFO: Step 1 - Fetching and parsing HTML - foo.html
INFO: Step 3 - Applying CSS
INFO: Step 4 - Creating formatting structure
INFO: Step 5 - Creating layout - Page 1
INFO: Step 6 - Drawing
INFO: Step 7 - Adding PDF metadata

liZe · 2018-08-16T19:26:39Z

I am just looking fro emulating the -v switch in commandline, which at least prints:

The -v option only sets the level to DEBUG, these lines come from the logging system. Just set the logging level to DEBUG, add a file handler and you'll get these lines in a log file!

vojkny · 2018-08-16T20:01:46Z

Would this be enough? (as in code above)

app.logger.setLevel(logging.DEBUG)

Because this doesn't work :/

liZe · 2018-08-16T20:08:56Z

Would this be enough?

No, it wouldn't. You have to use these lines (from the documentation):

import logging
logger = logging.getLogger('weasyprint')
logger.addHandler(logging.FileHandler('/path/to/weasyprint.log'))

And if you want to get debug messages:

logger.setLevel(logging.DEBUG)

app.logger is not WeasyPrint's logger, it's your Flask app's logger. Please read carefully the documentation (for WeasyPrint and Python's logging module), it's a bit painful to understand how logging works in Python but everything's there.

vojkny · 2018-08-16T20:18:17Z

Perfect, this is exactly what I was looking for. Sorry, I am not a python person, I am Java guy, so these basics really help.

vojkny · 2018-08-16T20:27:50Z

Okay, so I might need a bit more support, bbecause in logs I am getting this:

2018-08-16 20:25:29,803 INFO: PDF <!DOCTYPE html><html><head><meta charset="utf-8"/><style type="text/css">@page{size:69mm 146mm;margi [in /usr/src/app/wsgi.py:39]
2018-08-16 20:25:29,803 INFO: Step 1 - Fetching and parsing HTML - HTML string [in /usr/local/lib/python3.6/site-packages/weasyprint/__init__.py:85]
2018-08-16 20:25:30,799 INFO: Step 2 - Fetching and parsing CSS - CSS string [in /usr/local/lib/python3.6/site-packages/weasyprint/__init__.py:256]
2018-08-16 20:25:30,805 WARNING: Ignored `-moz-box-sizing:border-box` at 1:143, unknown property. [in /usr/local/lib/python3.6/site-packages/weasyprint/css/validation.py:2198]
2018-08-16 20:25:30,805 WARNING: Ignored `-webkit-box-sizing:border-box` at 1:170, unknown property. [in /usr/local/lib/python3.6/site-packages/weasyprint/css/validation.py:2198]
2018-08-16 20:25:30,881 INFO: Step 3 - Applying CSS [in /usr/local/lib/python3.6/site-packages/weasyprint/css/__init__.py:862]
2018-08-16 20:25:34,849 INFO: Step 4 - Creating formatting structure [in /usr/local/lib/python3.6/site-packages/weasyprint/document.py:318]
[2018-08-16 20:25:41 +0000] [507] [INFO] Booting worker with pid: 507

I am trying to compile quite a big pdf (200 pages) and I guess there might be some issues with resources. But the gunicorn worker dies without saying anything and one can see it is being restart immediately afterwards. It is hard to debug the problem with no errors :/

vojkny · 2018-08-16T21:13:00Z

Okay I think I finally see the issue. To generate 1500 pages, weasyprint consumer nearly 3 GB of memory. (testing file attached)

tst.htm.zip

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

liZe · 2018-08-16T23:16:52Z

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

WeasyPrint is known to be slow and to consume a lot of memory with big documents. Many things have been done to improve performance (see #70 for example). There's no config option to improve this, we need time to improve the current code, increase speed and decrease memory consumption.

liZe · 2019-01-03T13:16:30Z

Speed is much better but memory use is still bad…

WeasyPrint 0.42.3: 2.6 GB, 72s
WeasyPrint master: 2.6 GB, 39s

Afoucaul · 2019-01-21T07:01:21Z

@liZe I know it's not the most satisfying solution, but couldn't you split your HTML in several parts, generate distinct PDFs, and then merge them back together? With templates, you could do this in a clean way, and even parallelize the generation of the distinct bits.

I'm assuming that WeasyPrint keeps all the document in memory until the end of the operation, which would explain the increasing consumption of RAM.
If I'm right, then generating the doc one smaller bit at a time would reduce the memory usage over the generation operation...

liZe · 2019-01-21T09:35:48Z

I've created a small website with memory and speed graphs, it gives a good idea about improvements already made, but we have a lot of work left.

@liZe I know it's not the most satisfying solution, but couldn't you split your HTML in several parts, generate distinct PDFs, and then merge them back together? With templates, you could do this in a clean way, and even parallelize the generation of the distinct bits.

It sounds appealing, but some details prevent us from doing this in an easy way. To render page 2, you need to know were page 1 ends, and thus need to render the whole first page before. Another example: you can include the total number of pages in your first page (like "page 1 / 10"), but to get the total number of pages, you first need to render the whole document.

The only way to know where pages end is to go through the whole layout step 😒.

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

It could be fun to write memory to the disk and play with pickle…

liZe · 2021-05-05T15:49:17Z

Closing, but the discussion can continue in #578.

vojkny closed this as completed Aug 16, 2018

vojkny reopened this Aug 16, 2018

liZe added the performance Too slow renderings label Aug 16, 2018

liZe changed the title ~~Weasyprint in combination with gunicorn produces empty response with no error log~~ Weasyprint consumes a lot of memory for long documents Aug 17, 2018

liZe closed this as completed May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weasyprint consumes a lot of memory for long documents #671

Weasyprint consumes a lot of memory for long documents #671

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

liZe commented Jan 3, 2019

Afoucaul commented Jan 21, 2019 •

edited

liZe commented Jan 21, 2019

liZe commented May 5, 2021

Weasyprint consumes a lot of memory for long documents #671

Weasyprint consumes a lot of memory for long documents #671

Comments

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

vojkny commented Aug 16, 2018

liZe commented Aug 16, 2018

liZe commented Jan 3, 2019

Afoucaul commented Jan 21, 2019 • edited

liZe commented Jan 21, 2019

liZe commented May 5, 2021

Afoucaul commented Jan 21, 2019 •

edited