Skip to content

Commit

Permalink
Merge branch 'main' into rm-3.6
Browse files Browse the repository at this point in the history
  • Loading branch information
willkg committed Feb 9, 2022
2 parents 85b2a04 + f175b7c commit bda7741
Show file tree
Hide file tree
Showing 9 changed files with 417 additions and 347 deletions.
1 change: 1 addition & 0 deletions .github/workflows/lint.yml
@@ -1,3 +1,4 @@
---
name: Lint

on: [push, pull_request]
Expand Down
10 changes: 5 additions & 5 deletions .github/workflows/test.yml
@@ -1,3 +1,4 @@
---
name: Test

on: [push, pull_request]
Expand Down Expand Up @@ -34,17 +35,16 @@ jobs:
restore-keys: |
${{ matrix.os }}-${{ matrix.python-version }}-
- name: Install dev dependencies
run: |
python -m pip install -r requirements-dev.txt
- name: Print and compare hashes for python and platform specific libraries
run: |
python -m pip install -U pip setuptools>=18.5 pip-tools==6.2.0
pip-compile --generate-hashes requirements-dev.in > requirements-dev.tmp
echo "diffing requirements-dev.txt and requirements-dev.tmp"
diff requirements-dev.txt requirements-dev.tmp || true
- name: Install dev dependencies
run: |
python -m pip install -r requirements-dev.txt
- name: Tests
shell: bash
run: ./scripts/run_tests.sh
9 changes: 0 additions & 9 deletions CODE_OF_CONDUCT.rst

This file was deleted.

8 changes: 6 additions & 2 deletions docs/dev.rst
Expand Up @@ -19,10 +19,14 @@ To install Bleach to make changes to it:
$ pip install -e .


.. include:: ../CONTRIBUTING.rst
Code of conduct
===============

This project has a `code of conduct
<https://github.com/mozilla/bleach/blob/main/CODE_OF_CONDUCT.md>`_.

.. include:: ../CODE_OF_CONDUCT.rst

.. include:: ../CONTRIBUTING.rst


Docs
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Expand Up @@ -12,6 +12,7 @@ Contents
goals
dev
changes
migrating


Indices and tables
Expand Down
106 changes: 106 additions & 0 deletions docs/migrating.rst
@@ -0,0 +1,106 @@
.. highlight:: python

=====================================
Migrating from the html5lib sanitizer
=====================================

The `html5lib <https://github.com/html5lib/html5lib-python>`_ module `deprecated
<https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#11>`_ its
own sanitizer in version 1.1. The maintainers "recommend users migrate to
Bleach." This tracks the issues encountered in the migration.

Migration path
==============

If you upgrade to html5lib 1.1+, you may get deprecation warnings when using its
sanitizer. If you follow the recommendation and switch to Bleach for
sanitization, you'll need to spend time tuning the Bleach sanitizer to your
needs because the Bleach sanitizer has different goals and is not a drop-in
replacement for the html5lib one.

Here is an example of replacing the sanitization method:

.. code::
fragment = "<a href='https://github.com'>good</a> <script>bad();</script>"
import html5lib
parser = html5lib.html5parser.HTMLParser()
parsed_fragment = parser.parseFragment(fragment)
print(html5lib.serialize(parsed_fragment, sanitize=True))
# '<a href="https://github.com">good</a> &lt;script&gt;bad();&lt;/script&gt;'
import bleach
print(bleach.clean(fragment))
# '<a href="https://github.com">good</a> &lt;script&gt;bad();&lt;/script&gt;'
Escaping differences
====================

While html5lib will leave 'single' and "double" quotes alone, Bleach will escape
them as the corresponding HTML entities (``'`` becomes ``&#39;`` and ``"``
becomes ``&#34;``). This should be fine in most rendering contexts.

Different allow lists
=====================

By default, html5lib and Bleach "allow" (i.e. don't sanitize) different sets of
HTML elements, HTML attributes, and CSS properties. For example, html5lib will
leave ``<u/>`` alone, while Bleach will sanitize it:

.. code::
fragment = "<u>hi</u>"
import html5lib
parser = html5lib.html5parser.HTMLParser()
parsed_fragment = parser.parseFragment(fragment)
print(html5lib.serialize(parsed_fragment, sanitize=True))
# '<u>hi</u>'
print(bleach.clean(fragment))
# '&lt;u&gt;hi&lt;/u&gt;'
If you wish to retain the sanitization behaviour with respect to specific HTML
elements, use the ``tags`` argument (see the :ref:`chapter on clean()
<clean-chapter>` for more info):

.. code::
fragment = "<u>hi</u>"
print(bleach.clean(fragment, tags=['u']))
# '<u>hi</u>'
If you want to stick to the html5lib sanitizer's allow lists, get them from the
`sanitizer code
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/filters/sanitizer.py>`_.
It's probably best to copy them as static lists (as opposed to importing the
module and reading them dynamically) because

* the lists are not part of the html5lib API
* the sanitizer module is already deprecated and might disappear
* importing the sanitizer module gives the deprecation warning (unless you take
the effort to filter it)


.. code::
SAFE_ELEMENTS = ["b", "p", "div"]
SAFE_ATTRIBUTES = ["style"]
SAFE_CSS_PROPERTIES = ["color"]
fragment = "some unsafe html"
safe_html = bleach.clean(
fragment,
tags=SAFE_ELEMENTS,
attributes=SAFE_ATTRIBUTES,
styles=SAFE_CSS_PROPERTIES
)
1 change: 1 addition & 0 deletions requirements-dev.in
@@ -1,6 +1,7 @@
# Requirements for installing other packages
pip
setuptools>=18.5
pip-tools==6.5.0

# Requirements to run the test suite
pytest
Expand Down

0 comments on commit bda7741

Please sign in to comment.