Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImportError: lxml.html.clean module is now a separate project lxml_html_clean over test suite (Ubuntu 20.04 and benchmarks) #6798

Open
dalthviz opened this issue Apr 1, 2024 · 14 comments
Labels

Comments

@dalthviz
Copy link
Member

dalthviz commented Apr 1, 2024

🧰 Task

Seems like the latest release of lxml (5.2.0) moved a module to be an independant package (lxml.html.clean -> new lxml_html_clean package)? See https://github.com/napari/napari/actions/runs/8510883283/job/23309875720?pr=6794#step:11:174 and https://github.com/napari/napari/actions/runs/8510883283/job/23309397479?pr=6794#step:8:144

@psobolewskiPhD
Copy link
Member

psobolewskiPhD commented Apr 1, 2024

I looked into this for a bit more context:
https://lxml.de/5.2/changes-5.2.0.html

The lxml.html.clean implementation suffered from several (only if used) security issues in the past and was now extracted into a separate library:
https://github.com/fedora-python/lxml_html_clean
Projects that use lxml without "lxml.html.clean" will not notice any difference, except that they won't have potentially vulnerable code installed. The module is available as an "extra" setuptools dependency "lxml[html_clean]", so that Projects that need "lxml.html.clean" will need to switch their requirements from "lxml" to "lxml[html_clean]", or install the new library themselves.

Some more discussion:
https://bugs.launchpad.net/lxml/+bug/1958539
Googling around, it looks like there is some folks looking into alternatives to html_clean:
psf/requests-html#558
nh3 seems promising:
https://github.com/messense/nh3

Maybe in the short term we should migrate to the html_clean package, just to silence the CI errors but then we may want to consider whether we should replace it?

@psobolewskiPhD
Copy link
Member

I played with nh3, here's a branch using that instead of lxml:
https://github.com/psobolewskiPhD/napari/tree/use_nh3_html_sanitizer

The main difference is handling quotes, where nh3.clean doesn't escape quotes.

@dalthviz
Copy link
Member Author

dalthviz commented Apr 2, 2024

Oh I see, seems like then moving away from lxml_clean and using an alternative could be quite worthy 👍

Czaki pushed a commit that referenced this issue Apr 4, 2024
…ml_html_clean` over test suite (Ubuntu 20.04 and benchmarks) (#6799)

# References and relevant issues

Part of #6798

# Description

Fix test suite by adding new `lxml_html_clean` dependency due to `lxml`
5.2.0 moving the `lxml.html.clean` module to that package
(`lxml_html_clean`)

---------

Co-authored-by: Peter Sobolewski <76622105+psobolewskiPhD@users.noreply.github.com>
@adriens
Copy link

adriens commented Apr 9, 2024

Currently facing this error on a planned Notebook too

@adriens
Copy link

adriens commented Apr 9, 2024

Hi guys, how should we overcome this
image

@adriens
Copy link

adriens commented Apr 9, 2024

Currently patched to this :

!pip install --upgrade lxml_html_clean

import geograpy

url = 'https://en.wikipedia.org/wiki/2012_Summer_Olympics_torch_relay'
places = geograpy.get_geoPlace_context(url=url)
print(places)

image

@adriens
Copy link

adriens commented Apr 9, 2024

It perfectly worked : I could update 🥷 Neo4J Ninjas duckdb dataset 🦆

@psobolewskiPhD
Copy link
Member

It's impossible to figure out what the issue is.
We'll need the text of the entire traceback. You can post it between sets of three backticks ` so it's formatted as code.

@jamesnq
Copy link

jamesnq commented May 15, 2024

photo_2024-05-15_10-23-13

I am currently facing this error, are there any way to fix it? Thank you!

@melissawm
Copy link
Member

@jamesnq are you using napari? There's no way to tell from your screenshot. Please post the entire traceback as text. Thanks!

@jamesnq
Copy link

jamesnq commented May 15, 2024

@jamesnq are you using napari? There's no way to tell from your screenshot. Please post the entire traceback as text. Thanks!

Thanks for your reply, I already fixed the error. I install the lxml_html_clean library and run with python 3.10 instead of 3.12 and it works!

@CamachoDejay
Copy link

I had a similar issue, I am running a python 3.10 env, to solve the problem had to do:
conda update napari
pip install lxml[html_clean]

@ehlui
Copy link

ehlui commented May 26, 2024

Try to

pip install lxml_html_clean

It might do the trick. This worked for me

@jamesnq
Copy link

jamesnq commented May 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants