Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement safe_url_string based on the URL living standard #204

Open
Gallaecio opened this issue Nov 23, 2022 · 4 comments
Open

Reimplement safe_url_string based on the URL living standard #204

Gallaecio opened this issue Nov 23, 2022 · 4 comments

Comments

@Gallaecio
Copy link
Member

Gallaecio commented Nov 23, 2022

#201 implemented an alternative to safe_url_string based on the URL living standard, while maintaining support for older standards still used in the wild, for maximum safety of the resulting URL.

However, when we compared the performance of the new implementation with the old one, the new implementation was up to 200 times slower, and after some effort I only managed to reduce the gap to 40. 40 times slower is still unacceptable.

This task is about resuming the work started on #201, closing the performance gap by achieving a performance very similar or better than that of the current safe_url_string implementation.

How to do that? At the moment we believe Cython is the way to go, i.e. rework the new implementation as Cython code, which hopefully will solve the performance issues.

@wRAR
Copy link
Member

wRAR commented Nov 23, 2022

Two points about Cython: first, I wonder if it's guaranteed to give comparable performance or is just an idea to try? Second, this will make w3lib a binary module, though both parsel and Scrapy already require lxml which is even more complicated.

@kmike
Copy link
Member

kmike commented Nov 23, 2022

I think it might be reasonable to move the URL-related implementation to a separate package, to simplify contribution to the rest of w3lib - to work on w3lib itself it won't be required to have Cython installed. URL parsing according to Live Standard is a very well-defined scope, I think it could makes sense to have it separate even if it's pure Python.

It'd still be a binary dependency, but wheels work well these days, and unlike lxml there are no complications like the libxml dependency (though we might think about wrapping some external library, I haven't looked into it).

@kmike
Copy link
Member

kmike commented Nov 23, 2022

Two points about Cython: first, I wonder if it's guaranteed to give comparable performance or is just an idea to try?

I'd say it's guaranteed that we can get comparable or better performance. With Cython you can go as close to C/C++ as you want.

@lopuhin
Copy link
Member

lopuhin commented Nov 25, 2022

https://github.com/mypyc/mypyc could be another option besides Cython, it may allow to keep the code python-compatible (but a separate package and a binary wheel would still be worthwhile).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants