Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weasyprint consumes a lot of memory for long documents #671

Closed
vojkny opened this issue Aug 16, 2018 · 14 comments
Closed

Weasyprint consumes a lot of memory for long documents #671

vojkny opened this issue Aug 16, 2018 · 14 comments
Labels
performance Too slow renderings

Comments

@vojkny
Copy link

vojkny commented Aug 16, 2018

This is probably something in combination with gunicorn, most probably not having to do anything with weasyprint, but I am stuck and have no idea which way to go.

import logging

from flask import Flask, request, make_response, jsonify
from weasyprint import HTML

app = Flask('pdf')


@app.before_first_request
def setup_logging():
    logging.addLevelName(logging.DEBUG, "\033[1;36m%s\033[1;0m" % logging.getLevelName(logging.DEBUG))
    logging.addLevelName(logging.INFO, "\033[1;32m%s\033[1;0m" % logging.getLevelName(logging.INFO))
    logging.addLevelName(logging.WARNING, "\033[1;33m%s\033[1;0m" % logging.getLevelName(logging.WARNING))
    logging.addLevelName(logging.ERROR, "\033[1;31m%s\033[1;0m" % logging.getLevelName(logging.ERROR))

    handler = logging.StreamHandler()
    handler.setFormatter(logging.Formatter(
        '%(asctime)s %(levelname)s: %(message)s '
        '[in %(pathname)s:%(lineno)d]'
    ))
    app.logger.addHandler(handler)
    app.logger.setLevel(logging.DEBUG)


@app.route('/', methods=['POST'])
def generate():
    name = request.args.get('filename', 'export.pdf')
    app.logger.info('PDF %s' % " ".join(request.data.decode('utf-8').split()).replace("> <", "><")[0:100])
    html = HTML(string=request.data)
    pdf = html.write_pdf()
    response = make_response(pdf)
    response.headers['Content-Type'] = 'application/pdf'
    response.headers['Content-Disposition'] = 'inline;filename=%s' % name
    return response

if __name__ == '__main__':
    app.run()

I run this script in a docker image which runs mostly fine, but I suspect all the weasyprint logs are thrown away as i never see them logged. Especially in case of an error, I just get an empty result with no error.

What kind of logging tool weasyprint uses? Is it somehow compatible with the way gunicorn logs stuff?

@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Also, is there a way to make weasyrpint more verbose programmatically?

@liZe
Copy link
Member

liZe commented Aug 16, 2018

What kind of logging tool weasyprint uses? Is it somehow compatible with the way gunicorn logs stuff?

WeasyPrint uses the default Python logger, you'll find more information in the Logging section of the documentation.

Also, is there a way to make weasyrpint more verbose programmatically?

Everything can be customized with the logging module. WeasyPrint is not really verbose, but it should be enough for your needs.

Especially in case of an error, I just get an empty result with no error.

Please check the logs and tell me if it helps!

@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Everything can be customized with the logging module. WeasyPrint is not really verbose, but it should be enough for your needs.

I am just looking fro emulating the -v switch in commandline, which at least prints:

INFO: Step 1 - Fetching and parsing HTML - foo.html
INFO: Step 3 - Applying CSS
INFO: Step 4 - Creating formatting structure
INFO: Step 5 - Creating layout - Page 1
INFO: Step 6 - Drawing
INFO: Step 7 - Adding PDF metadata

@liZe
Copy link
Member

liZe commented Aug 16, 2018

I am just looking fro emulating the -v switch in commandline, which at least prints:

The -v option only sets the level to DEBUG, these lines come from the logging system. Just set the logging level to DEBUG, add a file handler and you'll get these lines in a log file!

@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Would this be enough? (as in code above)

app.logger.setLevel(logging.DEBUG)

Because this doesn't work :/

@liZe
Copy link
Member

liZe commented Aug 16, 2018

Would this be enough?

No, it wouldn't. You have to use these lines (from the documentation):

import logging
logger = logging.getLogger('weasyprint')
logger.addHandler(logging.FileHandler('/path/to/weasyprint.log'))

And if you want to get debug messages:

logger.setLevel(logging.DEBUG)

app.logger is not WeasyPrint's logger, it's your Flask app's logger. Please read carefully the documentation (for WeasyPrint and Python's logging module), it's a bit painful to understand how logging works in Python but everything's there.

@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Perfect, this is exactly what I was looking for. Sorry, I am not a python person, I am Java guy, so these basics really help.

@vojkny vojkny closed this as completed Aug 16, 2018
@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Okay, so I might need a bit more support, bbecause in logs I am getting this:

2018-08-16 20:25:29,803 INFO: PDF <!DOCTYPE html><html><head><meta charset="utf-8"/><style type="text/css">@page{size:69mm 146mm;margi [in /usr/src/app/wsgi.py:39]
2018-08-16 20:25:29,803 INFO: Step 1 - Fetching and parsing HTML - HTML string [in /usr/local/lib/python3.6/site-packages/weasyprint/__init__.py:85]
2018-08-16 20:25:30,799 INFO: Step 2 - Fetching and parsing CSS - CSS string [in /usr/local/lib/python3.6/site-packages/weasyprint/__init__.py:256]
2018-08-16 20:25:30,805 WARNING: Ignored `-moz-box-sizing:border-box` at 1:143, unknown property. [in /usr/local/lib/python3.6/site-packages/weasyprint/css/validation.py:2198]
2018-08-16 20:25:30,805 WARNING: Ignored `-webkit-box-sizing:border-box` at 1:170, unknown property. [in /usr/local/lib/python3.6/site-packages/weasyprint/css/validation.py:2198]
2018-08-16 20:25:30,881 INFO: Step 3 - Applying CSS [in /usr/local/lib/python3.6/site-packages/weasyprint/css/__init__.py:862]
2018-08-16 20:25:34,849 INFO: Step 4 - Creating formatting structure [in /usr/local/lib/python3.6/site-packages/weasyprint/document.py:318]
[2018-08-16 20:25:41 +0000] [507] [INFO] Booting worker with pid: 507

I am trying to compile quite a big pdf (200 pages) and I guess there might be some issues with resources. But the gunicorn worker dies without saying anything and one can see it is being restart immediately afterwards. It is hard to debug the problem with no errors :/

@vojkny vojkny reopened this Aug 16, 2018
@vojkny
Copy link
Author

vojkny commented Aug 16, 2018

Okay I think I finally see the issue. To generate 1500 pages, weasyprint consumer nearly 3 GB of memory. (testing file attached)

tst.htm.zip

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

@liZe
Copy link
Member

liZe commented Aug 16, 2018

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

WeasyPrint is known to be slow and to consume a lot of memory with big documents. Many things have been done to improve performance (see #70 for example). There's no config option to improve this, we need time to improve the current code, increase speed and decrease memory consumption.

@liZe liZe added the performance Too slow renderings label Aug 16, 2018
@liZe liZe changed the title Weasyprint in combination with gunicorn produces empty response with no error log Weasyprint consumes a lot of memory for long documents Aug 17, 2018
@liZe
Copy link
Member

liZe commented Jan 3, 2019

Speed is much better but memory use is still bad…

  • WeasyPrint 0.42.3: 2.6 GB, 72s
  • WeasyPrint master: 2.6 GB, 39s

@Afoucaul
Copy link

Afoucaul commented Jan 21, 2019

@liZe I know it's not the most satisfying solution, but couldn't you split your HTML in several parts, generate distinct PDFs, and then merge them back together? With templates, you could do this in a clean way, and even parallelize the generation of the distinct bits.

I'm assuming that WeasyPrint keeps all the document in memory until the end of the operation, which would explain the increasing consumption of RAM.
If I'm right, then generating the doc one smaller bit at a time would reduce the memory usage over the generation operation...

@liZe
Copy link
Member

liZe commented Jan 21, 2019

I've created a small website with memory and speed graphs, it gives a good idea about improvements already made, but we have a lot of work left.

@liZe I know it's not the most satisfying solution, but couldn't you split your HTML in several parts, generate distinct PDFs, and then merge them back together? With templates, you could do this in a clean way, and even parallelize the generation of the distinct bits.

It sounds appealing, but some details prevent us from doing this in an easy way. To render page 2, you need to know were page 1 ends, and thus need to render the whole first page before. Another example: you can include the total number of pages in your first page (like "page 1 / 10"), but to get the total number of pages, you first need to render the whole document.

The only way to know where pages end is to go through the whole layout step 😒.

Couldn't it be more effective and maybe write to disk instead of to memory? Maybe config option?

It could be fun to write memory to the disk and play with pickle

@liZe
Copy link
Member

liZe commented May 5, 2021

Closing, but the discussion can continue in #578.

@liZe liZe closed this as completed May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Too slow renderings
Projects
None yet
Development

No branches or pull requests

3 participants