Stripping non-tags #79

spekary · 2019-03-11T17:15:56Z

bluemonday.StrictPolicy().Sanitize("a<b")

returns "a".

Is there a reason its not looking for an actual tag, or is this a mistake?

The text was updated successfully, but these errors were encountered:

spekary · 2019-03-11T17:43:27Z

Actually the UGCPolicy is doing the same thing. If someone was in a chat conversation about something math related, like "if a<b and b<c, then a<c", this all gets stripped after the first a.

pauln · 2019-08-14T03:26:49Z

Further to this, it only happens if the < is immediately followed by a letter.

a<b becomes a
a < b becomes a < b
a<3 becomes a<3

marotpam · 2020-02-17T10:12:45Z

Hi @buro9 any updates on this? Or has this problem been tackled in any other issue?

theoptz · 2022-12-27T15:40:28Z

have the same issue

package main

import (
 "fmt"

 "github.com/microcosm-cc/bluemonday"
)

func main() {
 p := bluemonday.NewPolicy().AllowElements("p")
 fmt.Println(p.Sanitize("<p>hello</p>"))   // got "<p>hello</p>"
 fmt.Println(p.Sanitize("<p>< hello</p>")) // got "<p>&lt; hello</p>"
 fmt.Println(p.Sanitize("<p><hello</p>"))  // got "<p>" but expected "<p>&lt;hello</p>"
}

buro9 · 2022-12-27T15:50:45Z

This stems from the Go HTML tokenizer and it is performing to how I think it should, but I'd note that the tokenizer is not part of bluemonday, we just depend on it. It's maintained by the core Go team.

Specifically it's this package golang.org/x/net/html

And what is happening depends on what follows a <.

If < is encountered, that is "lt space", then it is being tokenized as a less than character followed by whitespace.

If <a is encountered, that is "lt letter-a", then it is being tokenized as the start of a tag... in this case a would be anchor, so the start of an anchor tag. And what happens next is according to the tokenizer now thinking it's in an HTML element declaration and it's consuming input awaiting a closing tag of some kind, a >.

To the examples given:

By pauln on 14th Aug: 2019... it only happens when followed by a letter, because the rules on how to parse HTML are contained in the HTML tokenizer package and it understands that HTML element names starts with letters.
By theoptz 7 minutes ago... it only happens in the example given as the lt followed by h is interpreted as an opening HTML tag.

bluemonday reads the token type for each token read, and bluemonday receives lt letters as the start of a HTML tag.

this behaviour is expected.

spekary changed the title ~~StrictPolicy strips non-tags.~~ Stripping non-tags Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stripping non-tags #79

Stripping non-tags #79

spekary commented Mar 11, 2019

spekary commented Mar 11, 2019

pauln commented Aug 14, 2019

marotpam commented Feb 17, 2020

theoptz commented Dec 27, 2022

buro9 commented Dec 27, 2022

Stripping non-tags #79

Stripping non-tags #79

Comments

spekary commented Mar 11, 2019

spekary commented Mar 11, 2019

pauln commented Aug 14, 2019

marotpam commented Feb 17, 2020

theoptz commented Dec 27, 2022

buro9 commented Dec 27, 2022