Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert & to & as a Characters token #429

Merged
merged 1 commit into from Jan 4, 2019
Merged

Convert & to & as a Characters token #429

merged 1 commit into from Jan 4, 2019

Conversation

willkg
Copy link
Member

@willkg willkg commented Jan 3, 2019

This fixes a problem in LinkifyFilter when using it with the Cleaner where
the Cleaner sets up the tokenizer to not consume entities. So character
entities end up in their own Entity tokens and Linkifyfilter can't match
links that cross token boundaries. If there's a &, then LinkifyFilter
won't match across that.

This fixes that by converting & to & in the sanitizer when it's pulling out
entities and putting them in separate Entity tokens. The & Characters tokens
will get merged by BleachSanitizerFilter.__iter__ and & will get converted
back to & in the serialier.

Fixes #422

This fixes a problem in LinkifyFilter when using it with the Cleaner where
the Cleaner sets up the tokenizer to not consume entities. So character
entities end up in their own Entity tokens and Linkifyfilter can't match
links that cross token boundaries. If there's a &, then LinkifyFilter
won't match across that.

This fixes that by converting & to & in the sanitizer when it's pulling out
entities and putting them in separate Entity tokens. The & Characters tokens
will get merged by BleachSanitizerFilter.__iter__ and & will get converted
back to & in the serialier.

Fixes #422
@willkg willkg requested a review from g-k January 3, 2019 23:08
@willkg
Copy link
Member Author

willkg commented Jan 3, 2019

These fiddly problems drive me crazy. This one took hours to work out. I think this is the best approach. I looked at a few others involving modifying the tokenizer, the parser, and wrapping LinkifyFilter in a merge_charaters kind of iterator. All those options had various problems from being hard to reason about or adding another performance hit or I couldn't make them work. Ugh.

One nice thing about this approach is that if we hit other character entities that need to be expanded, we can do that in this section and it'll "just work".

If you see anything better, I'm all ears.

Copy link
Collaborator

@g-k g-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ lgtm

@willkg willkg merged commit 93a060e into mozilla:master Jan 4, 2019
@willkg
Copy link
Member Author

willkg commented Jan 4, 2019

Thank you!

@willkg willkg deleted the 422-amp branch January 7, 2019 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LinkifyFilter with clean breaks links at &
2 participants