Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace lxml.html.clean_html with bleach; drop lxml dependency #1854

Merged
merged 3 commits into from Sep 6, 2022

Conversation

akx
Copy link
Contributor

@akx akx commented Sep 1, 2022

There's no need to pull in all of lxml for just clean_html, especially since nbconvert already depends on bleach anyway.

This PR:

  1. removes the unused, outdated, auto-generated notebook1.html file from nbconvert/tests/files
  2. simplifies a test that would fail with the next commit
  3. drops the lxml dependency in favor of bleach; see below.

So turns out bleach was a bit stricter by default than lxml's clean_html, so I had to tweak the allowed tags rules a little:

  • The div, pre, code, span tags (used by highlighting) are explicitly allowed
  • The class and id attributes are explicitly allowed.

Beyond that, I manually checked that all of the .ipynb files under nbconvert/examples/tests are equivalently generated aside from " being " and ' being '.

@blink1073
Copy link
Member

I'm a bit concerned about whether it is a drop in replacement. There is one test that is failing that looks to be a result of different filtering.

@akx
Copy link
Contributor Author

akx commented Sep 5, 2022

@blink1073 I'll take a look. Thing is the filter isn't quite documented in any way.

@akx
Copy link
Contributor Author

akx commented Sep 5, 2022

@blink1073 The single test failure was due to a different quoting; bleach doesn't let a bare ' through but converts it into an entity reference.

@blink1073
Copy link
Member

I changed the setting so you won't need approval to run the CI workflows. If there is a slight tweak that can be made to the expected output that doesn't affect security then that seems reasonable.

@akx akx force-pushed the no-lxml branch 2 times, most recently from 5ecd89d to be116fa Compare September 6, 2022 12:09
@akx
Copy link
Contributor Author

akx commented Sep 6, 2022

@blink1073 Thanks – that makes running CI a little easier. Please see the updated PR description, too.

Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thank you!

@blink1073 blink1073 merged commit b40bb13 into jupyter:main Sep 6, 2022
element = str(element)
return bleach.clean(
element,
tags=[*bleach.ALLOWED_TAGS, "div", "pre", "code", "span"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clean_html function is clearly unsuitable to be called on image/svg+xml elements, since it will escape all xml tags. Unfortunately the base.html.j2 templates call clean_html in this context. See #1849.

Rather than listing tags that are allowed, would it be possible to list tags that are disallowed? (Like I suspect <script> and <iframe> are unwelcome. )

Copy link
Contributor Author

@akx akx Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a bummer. No tests clearly covered this, so that's a regression. Sorry for the trouble... SVG tags are well-known though so they could be added to the allowlist...

akx added a commit to akx/nbconvert that referenced this pull request Oct 26, 2022
akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022
akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022
akx added a commit to akx/nbconvert that referenced this pull request Oct 27, 2022
@yuvipanda
Copy link
Contributor

It looks like table and friends is also something that bleach maybe filters out while lxml passes through. Since pygments supports linenumbers via tables (see #1683), with a new enough version of nbconvert, enabling line numbers as described in the linked PR makes them display like this:

image

With the workaround specified in #1892 (comment), it looks ok again:

image

yuvipanda added a commit to yuvipanda/nbconvert that referenced this pull request Dec 18, 2023
pygments emits line numbers via tables, with tr and td
elements (https://pygments.org/docs/formatters/#HtmlFormatter).
lxml's clean_html considers table, tr and td as safe elements,
but with jupyter#1854 they
are now considered unsafe. So instead of displaying line numbers,
the table, tr and td elements are escaped, and show up as literal
HTML if trying to enable line numbers via the method introduced
in jupyter#1683.

This PR adds table, tr and td as safe elements so that line numbers
can continue to work.

I know that there are probably plans to move away from bleach
(jupyter#1892), but this is
a small and focused change so hopefully doesn't need to block on
@yuvipanda
Copy link
Contributor

Opened #2083 to add those :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants