Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is an html strip-tags method a worth stdlib addition? #14356

Closed
robacarp opened this issue Mar 10, 2024 · 7 comments
Closed

Is an html strip-tags method a worth stdlib addition? #14356

robacarp opened this issue Mar 10, 2024 · 7 comments

Comments

@robacarp
Copy link
Contributor

I'm working on a pure-crystal implementation of an HTML utility function called striptags. Would it be a welcome contribution to stdlib, perhaps under HTML, which currently provides only escape/unescape?

The implementation I'm working is a little more enhanced than just XML.parse(dirty).inner_text (a la crinja's striptags) -- I have added the ability to whitelist tags and attributes which can stay in the document, and it filters out the rest.

I was surprised to see a lack of html strip-tags routines in use in the crystal ecosystem. There are some minimal attempts, in crynja and in a handcrafted rust aggregator -- the latter seemed promising but eventually I realized it is simply wrapping a function from a rust crate.

NEED

This function is important for user contributed content sites, because it allows a non-regex based way of sanitizing user markup while still allowing user contributed markup. Attempts to sanitize html by way of regexes are notorious for being XSS vectors.

Comparison / Community Interest Survey

  • Rust has a crate which provides a striptags implementation, and it was ported over to crystal for the above linked project. There is also a Crate called Ammonia which provides whitelist-style sanitation.
  • Rails ActiveSupport also provides access to a well established gem for which ActiveSupport wraps into a function called striptags. This includes a whitelist, and also filters html attributes.
  • PHP provides this function, and it has a whitelist. It does not filter tag attributes, and it warns that it is insufficient for XSS prevention (likely due to the attribute filter issue).
  • Python seems to have a storied history here. A StackOverflow post became canonical, included in Django, and then was found to be vulnerable to XSS. That SO post was viewed 350k times over 15 years. There is now a library called Bleach (provided by Mozilla - 2.5k stars on github) which provides this feature, including a whitelist. As of about a year ago Bleach is deprecated. There is a new project called nh3 which is importing in the Ammonia HTML sanitizer from Rust.

Risks

  • stdlib bloat -- is it worth it to have this ability in stdlib?
  • getting it wrong -- if it's wrong, and people depend on it, then it'll be the source of a Crystal core CVE. On the other hand, if it's right, people can reach for it instead of resorting to a regex -- I saw one such str =~ /^\<\w+\>/ example in my searches.

Other options

  • It could be published as a shard, as has been done in several of the languages I surveyed. I believe that there is enough interest demonstrated that including it could be part of the "batteries included" notion, if that's still a thing.
  • It could just be published on my blog or in a random SO article, as Python's was for so long.

Cheers!

@straight-shoota
Copy link
Member

straight-shoota commented Mar 10, 2024

This sounds a lot like https://github.com/straight-shoota/sanitize

I published this as a shard for independent development and because of its complexity. I could see this shard (or a similar implementation) becoming a part of stdlib if we consider the use case to be very common. It's certainly a very important feature when you're dealing with untrusted HTML content.
An important aspect to such a security component is auditing. It would be devastating if a sanitization library becomes a liability (like the Python XSS example). So far, I'm not aware that my shard has received any relevant peer review.

@robacarp
Copy link
Contributor Author

Absolutely excellent, and I wish I'd been able to come across that in my searches.

@straight-shoota
Copy link
Member

Yeah, I guess there are quite a lot different terms for this kind of thing. Probably because you can view and resolve the problem from different angles. So that's not ideal for search.

What did you look for and where? Maybe we can improve discoverability a bit.

@robacarp
Copy link
Contributor Author

I was focused on the keyword "striptags". I looked in the HTML parsing section of awesome-crystal, and I search github for lang:Crystal striptags -- which is where I came across the inner-text pattern in crinja I referenced.

@straight-shoota
Copy link
Member

Thanks. lang:Crystal striptags works now 😏

https://shardbox.org/categories/HTML_XML_Parsing would've brought you there as well.

@robacarp
Copy link
Contributor Author

I've forgotten about or never heard of shardbox.org, but that makes me realize that I would have found it with https://shards.info/search?query=html too.

@kostya
Copy link
Contributor

kostya commented Mar 11, 2024

this example also similar to strip tags: https://github.com/kostya/lexbor/blob/master/examples/texts.cr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants