Best practices for white-listing tags #1347

Hvass-Labs · 2023-05-17T11:38:31Z

Hello,

I would like to only use a small subset of Markdown in a web-app for untrusted user-comments. I only want the most basic paragraph-formatting, list-items, bold-text, etc. I don't want to allow section-headers, links, images, etc.

I am confused about the best way to do this.

The docs recommend to use the Python packages bleach and bleach_whitelist to white/black-list certain HTML tags, but these packages seem to be deprecated. There are other Python packages that can do this such as LXML. But it seems to be an awkward way of doing it.

Ideally the Markdown parser would just parse the tags I have white-listed, and render everything else safely as text.

Based on a StackOverflow post and your own docs, I could do something like:

import markdown

md_text = \
"""
# Hello

- fing
- fong
- dong

<a href="hello">bong</a>

Hello **Text**
"""

md = markdown.Markdown()
md.parser.blockprocessors.deregister('hashheader')
md.parser.blockprocessors.deregister('setextheader')
md.parser.blockprocessors.deregister('code')
md.preprocessors.deregister('html_block')
md.inlinePatterns.deregister('html')
print(md.convert(md_text))

which prints:

<p># Hello</p>
<ul>
<li>fing</li>
<li>fong</li>
<li>dong</li>
</ul>
<p>&lt;a href="hello"&gt;bong&lt;/a&gt;</p>
<p>Hello <strong>Text</strong></p>

I think this solution would work for me. The problem is that I don't know which processors to deregister, because they don't seem to be specified in the docs.

I also worry that in a future release you might add other processors that I don't want, so I would have to keep informed about that and deregister those as well.

An easier solution for people like me who don't really know how your code works internally (and don't really want to learn, as I have a million other things to do), would be to provide a config of some kind, to specify which Markdown tags I want to support, like a white-list or allow-list.

Or perhaps if you make a comprehensive list of the processors you have, and then I can instantiate a Markdown parser with only the few processors I want.

Is this something you would consider making? Or what is your recommendation for only enabling a small subset of Markdown tags in your parser?

Thanks!

The text was updated successfully, but these errors were encountered:

waylan · 2023-05-18T19:21:42Z

I should note that I see Markdown as a whole, not pieces which can be used piecemeal. Therefore, the parser was not designed to be used piecemeal either. We do not support a specific way to do so and have no intention of adding support for such a feature. That said, some users have hacked together solutions with varying success.

You are correct that deregistering is probably not a good solution. The way the parser works, earlier processors will act on some parts of the syntax, which means that that syntax is not longer present when a later processor runs. If you remove the earlier processor, the later processor will now get a false positive on that same syntax. The entire system is meant to work as a whole. The extension API exists to add or alter the behavior and is not really intended to reduce the behavior. Although we are not preventing you from using it that way, it is likely that future updates could break your customization as we will not be testing for that use-case.

Our recommendation to use Bleach is not intended to serve this purpose either. It is intended to provide security when Markdown content if coming from an untrusted source. The idea is that the full Markdown parser works normally, however, the output is run through a sanitizer to ensure no third party code is injected into the site. Sometimes people will incorrectly think that limiting the Markdown syntax will address this, but as this article demonstrates, that is faulty reasoning. You actually need a sanitizer for this purpose.

It is unfortunate that Bleach is deprecated. I was not aware of that. Perhaps more unfortunate is that there are no other pure Python HTML sanitation libraries that I am aware of. That being the case, I have no good suggestions for how to sanitize Markdown's output. Apparently messense/nh3 is a Python binding to a Rust library (Ammonia), but personally I know nothing about it.

So, if you are actually looking for a sanitizer, I'm sorry I can't help as this is out-of-scope for a Markdown parser. And if you are looking to limit the Markdown parser for other reasons, then we do not support that use-case (and have no plans to). Perhaps another parser does; I don't know. Sorry I couldn't be more helpful.

Hvass-Labs · 2023-05-22T13:34:05Z

Thanks for the quick and detailed reply!

It appears there are a few more Python libraries for Markdown rendering, including: mistune, marko, and misaka. So I'll take a look at those.

Hvass-Labs closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for white-listing tags #1347

Best practices for white-listing tags #1347

Hvass-Labs commented May 17, 2023

waylan commented May 18, 2023

Hvass-Labs commented May 22, 2023

Best practices for white-listing tags #1347

Best practices for white-listing tags #1347

Comments

Hvass-Labs commented May 17, 2023

waylan commented May 18, 2023

Hvass-Labs commented May 22, 2023