Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices for white-listing tags #1347

Closed
Hvass-Labs opened this issue May 17, 2023 · 2 comments
Closed

Best practices for white-listing tags #1347

Hvass-Labs opened this issue May 17, 2023 · 2 comments

Comments

@Hvass-Labs
Copy link

Hello,

I would like to only use a small subset of Markdown in a web-app for untrusted user-comments. I only want the most basic paragraph-formatting, list-items, bold-text, etc. I don't want to allow section-headers, links, images, etc.

I am confused about the best way to do this.

The docs recommend to use the Python packages bleach and bleach_whitelist to white/black-list certain HTML tags, but these packages seem to be deprecated. There are other Python packages that can do this such as LXML. But it seems to be an awkward way of doing it.

Ideally the Markdown parser would just parse the tags I have white-listed, and render everything else safely as text.

Based on a StackOverflow post and your own docs, I could do something like:

import markdown

md_text = \
"""
# Hello

- fing
- fong
- dong

<a href="hello">bong</a>

Hello **Text**
"""

md = markdown.Markdown()
md.parser.blockprocessors.deregister('hashheader')
md.parser.blockprocessors.deregister('setextheader')
md.parser.blockprocessors.deregister('code')
md.preprocessors.deregister('html_block')
md.inlinePatterns.deregister('html')
print(md.convert(md_text))

which prints:

<p># Hello</p>
<ul>
<li>fing</li>
<li>fong</li>
<li>dong</li>
</ul>
<p>&lt;a href="hello"&gt;bong&lt;/a&gt;</p>
<p>Hello <strong>Text</strong></p>

I think this solution would work for me. The problem is that I don't know which processors to deregister, because they don't seem to be specified in the docs.

I also worry that in a future release you might add other processors that I don't want, so I would have to keep informed about that and deregister those as well.

An easier solution for people like me who don't really know how your code works internally (and don't really want to learn, as I have a million other things to do), would be to provide a config of some kind, to specify which Markdown tags I want to support, like a white-list or allow-list.

Or perhaps if you make a comprehensive list of the processors you have, and then I can instantiate a Markdown parser with only the few processors I want.

Is this something you would consider making? Or what is your recommendation for only enabling a small subset of Markdown tags in your parser?

Thanks!

@waylan
Copy link
Member

waylan commented May 18, 2023

I should note that I see Markdown as a whole, not pieces which can be used piecemeal. Therefore, the parser was not designed to be used piecemeal either. We do not support a specific way to do so and have no intention of adding support for such a feature. That said, some users have hacked together solutions with varying success.

You are correct that deregistering is probably not a good solution. The way the parser works, earlier processors will act on some parts of the syntax, which means that that syntax is not longer present when a later processor runs. If you remove the earlier processor, the later processor will now get a false positive on that same syntax. The entire system is meant to work as a whole. The extension API exists to add or alter the behavior and is not really intended to reduce the behavior. Although we are not preventing you from using it that way, it is likely that future updates could break your customization as we will not be testing for that use-case.

Our recommendation to use Bleach is not intended to serve this purpose either. It is intended to provide security when Markdown content if coming from an untrusted source. The idea is that the full Markdown parser works normally, however, the output is run through a sanitizer to ensure no third party code is injected into the site. Sometimes people will incorrectly think that limiting the Markdown syntax will address this, but as this article demonstrates, that is faulty reasoning. You actually need a sanitizer for this purpose.

It is unfortunate that Bleach is deprecated. I was not aware of that. Perhaps more unfortunate is that there are no other pure Python HTML sanitation libraries that I am aware of. That being the case, I have no good suggestions for how to sanitize Markdown's output. Apparently messense/nh3 is a Python binding to a Rust library (Ammonia), but personally I know nothing about it.

So, if you are actually looking for a sanitizer, I'm sorry I can't help as this is out-of-scope for a Markdown parser. And if you are looking to limit the Markdown parser for other reasons, then we do not support that use-case (and have no plans to). Perhaps another parser does; I don't know. Sorry I couldn't be more helpful.

@Hvass-Labs
Copy link
Author

Thanks for the quick and detailed reply!

It appears there are a few more Python libraries for Markdown rendering, including: mistune, marko, and misaka. So I'll take a look at those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants