Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bleach seems to be significantly slower than lxml in 7.1.x+ #1892

Open
MgenGlder opened this issue Oct 27, 2022 · 16 comments
Open

Bleach seems to be significantly slower than lxml in 7.1.x+ #1892

MgenGlder opened this issue Oct 27, 2022 · 16 comments

Comments

@MgenGlder
Copy link

MgenGlder commented Oct 27, 2022

Description

After updating nbconvert from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch to bleach for sanitization is the culprit here. I'm wondering if we could provide a configuration for users to pass
that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.

In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we'd love to keep in close sync with the latest releases, a couple steps behind if not update to date.

I recognize this may have been a deliberate design decision, I'm interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .

Relevant PR: #1854

Screenshots

7.0.0 Profile
Screen Shot 2022-10-27 at 2 32 22 PM

7.0.0 Flamegraph
Screen Shot 2022-10-27 at 2 31 18 PM

7.2.2 Profile
Screen Shot 2022-10-27 at 12 26 31 PM
7.2.2 Flamegraph
Screen Shot 2022-10-27 at 12 26 14 PM

@MgenGlder
Copy link
Author

Found the relevant PR here: #1854

@gwincr11
Copy link
Contributor

Interestingly we (GitHub) looked at using bleach as a second sanitization step on this project and decided against it because it caused too many notebooks to timeout and increased render times my more than 75%.

@bollwyvl
Copy link
Contributor

Along the same lines of python-fast-jsonschema on nbformat, it seems reasonable to have a hard dependency on bleach, and nbconvert[fast] to force the install of lxml (or whatever ends up being the fastest, maybe some rust-based thing). This would prefer the faster library, or behind NBCONVERT_FAST=0 and/or a traitlet.

The two could then be pitted against each other during test, probably after some further homogenization.

lxml isn't really that uncommon a dependency, and is rather widely supported (including pyodide).

@akx
Copy link
Contributor

akx commented Oct 27, 2022

Hi, author of #1854 here. Didn't imagine I'd open such a can of worms (#1863, #1849...)... The issue is lxml is presently a bit finicky to compile, and they don't currently e.g. ship wheels for macOS, so it felt like a good idea to switch from a binary library where only one function was being used to another library that was shipped with nbconvert anyway...

@bollwyvl
Copy link
Contributor

No worries. We can't make everybody happy all of the time. Given "simple" and "fast," the former should win as a minimum-testable-product, but having a documented approach for the latter is important.... so many different use cases.

ship wheels for macOS

I guess with free software on non-free operating systems, you get what you pay for.

finicky to compile

Of note, the lxml fields are fairly green over in conda-forge-land, and it's likely to have even 3.11 builds for many tricky python dependencies before they are canonically available as wheels. Indeed, the migration is just up to lxml right now... only a few thousand packages to go... but likely will reach a halting state in a reasonable amount of time. Meanwhile, thousands of upstream maintainers are unlikely to hop to building wheels for your chosen operating system. Indeed, the conda-forge migration will likely identify internally-consistent patches to overcome temporary setbacks, as its unlikely everyone has been testing against the rcs... these patches are often PR'd upstream, and sometimes even merged.

@akx
Copy link
Contributor

akx commented Oct 28, 2022

@bollwyvl Meh, the "get what you pay for" argument is a bit naff, since lxml used to ship wheels for macOS too up until there was a problem with wheels marked as multiarch not actually being so, so they were pulled and... https://bugs.launchpad.net/lxml/+bug/1913032
I should also note that lxml wheels are shipped for Windows (and actually pypy on macOS too...)

There's also a bit of discussion (can't find a link right now) around whether or not libxml, libxslt and the other C friends required by lxml should be linked statically into the lxml module shipped by a PyPI wheel – this is of extra interest when there are (security) bugs in those libraries, see e.g. GHSA-wrxv-2j5q-m38w, and if the wheels were dynamically linked, you'd easily bump into ABI incompatibilities (I think this happened with xmlsec).

@akx
Copy link
Contributor

akx commented Oct 28, 2022

For the time being, a workaround, if you're willing to try it, is to patch lxml back into place in the filter dict:

from nbconvert.exporters.templateexporter import default_filters
from lxml.html.clean import clean_html
default_filters["clean_html"] = clean_html

This essentially reverts the change made to the filters dict in b40bb13.

It should be noted, though, that lxml.clean_html may be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539

@gwincr11
Copy link
Contributor

We could help by setting up a devcontainer for codespaces development of this repo, then folks on a mac could just use a pre-configured linux environment for this.

cc @craigpeters @cmuto09

@bollwyvl
Copy link
Contributor

devcontainer for codespaces

Not to be too blunt, but containers and SaaS solutions sweep these problem under the covers, and are not viable in a number of settings or for a number of users. Feel free to propose and maintain such a solution in perpetuity.

But I feel the issue at hand is not ease of development on this repo, but balancing performance and deployment simplicity on millions of devices: nbconvert is used on disconnected RaspberryPis and exotic, airgapped big iron to show what scientists, analysts, and other non-developers, and their machines have been doing.

@gwincr11
Copy link
Contributor

You know far more about how this is used than I do. I just thought it may help in some capacity.

@gwincr11
Copy link
Contributor

I would imagine that Bleach is painfully slow on a Raspberry PI if it is this slow on a powerful machine 😓

@bollwyvl
Copy link
Contributor

painfully slow

But can be easily installed, works, and requires no porting to new platforms where python works.

As suggested, lxml is the low-hanging fruit, and widely used (e.g. by mypy reporting) but as noted, a blocklist-based sanitizer is less desirable for this task.

nh3, a python binding to rust's ammonia claim to be 15x faster than bleach, but would need some more knobs to twiddle to work for the nbconvert use case.

But even it that worked gangbusters, and generated identical output to bleach, rust doesn't even support some architectures, much less would one expect the maintainer to build binary wheels built for them... and this is going to cause trouble for more exotic parts of the jupyter stack in the coming year. But this doesn't mean it should be an option for use cases like yours.

@bollwyvl
Copy link
Contributor

As an update, nh3 0.2.0 is now sufficiently configurable to do what bleach is doing to HTML... except for the contents of style attributes/tags, which can contain all kinds of evil things.

It looks like a maximally secure approach would still have to use bleach if style appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.

Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having lxml around, even if nh3 was used for the cleaning process.

So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.

@MgenGlder
Copy link
Author

That makes sense to me- I imagine if nh3 gives the same structure and allow-list based sanitization then it would make a great optional sanitizer to use in cases like ours where performance for web is the concern.

So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.

On board with this, benchmarking the uses of each library and comparing and contrasting their tradeoffs would be great for confidence and documenting the decision made for future contributors and library consumers that may want to try out different sanitizers.

@bollwyvl
Copy link
Contributor

Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven't yet assessed), but is also now officially deprecated. We'll need to do something within the next year or so.

@xmo-odoo
Copy link

xmo-odoo commented Jan 27, 2023

It looks like a maximally secure approach would still have to use bleach if style appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.

FWIW the underlying Ammonia library allows essentially arbitrary rewriting of attributes, though it has a fair amount of special cases which don't use this engine (e.g. classes rewriting), and no special support for @style that I could see.

For nbconvert two options would be to either have the nh3 API expanded for this (and possibly other things, nh3 is currently quite limited), or to depend on Ammonia directly and build its own sanitizer with the additional build and distribution complexity that entails as nbviewer would now be a non-native wheel. Though this could also solve distribution issues related to nh3: distros which provide nbconvert would need to add nh3 to their packages to support an nh3-based nbconvert. I'm not sure what the policies are for build-time dependencies but e.g. ammonia is already in debian testing.

A third option would be to get lxml to use ammonia for cleaning internally, but

yuvipanda added a commit to yuvipanda/nbconvert that referenced this issue Dec 18, 2023
pygments emits line numbers via tables, with tr and td
elements (https://pygments.org/docs/formatters/#HtmlFormatter).
lxml's clean_html considers table, tr and td as safe elements,
but with jupyter#1854 they
are now considered unsafe. So instead of displaying line numbers,
the table, tr and td elements are escaped, and show up as literal
HTML if trying to enable line numbers via the method introduced
in jupyter#1683.

This PR adds table, tr and td as safe elements so that line numbers
can continue to work.

I know that there are probably plans to move away from bleach
(jupyter#1892), but this is
a small and focused change so hopefully doesn't need to block on
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants