Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stripping non-tags #79

Open
spekary opened this issue Mar 11, 2019 · 5 comments
Open

Stripping non-tags #79

spekary opened this issue Mar 11, 2019 · 5 comments

Comments

@spekary
Copy link

spekary commented Mar 11, 2019

bluemonday.StrictPolicy().Sanitize("a<b")

returns "a".

Is there a reason its not looking for an actual tag, or is this a mistake?

@spekary spekary changed the title StrictPolicy strips non-tags. Stripping non-tags Mar 11, 2019
@spekary
Copy link
Author

spekary commented Mar 11, 2019

Actually the UGCPolicy is doing the same thing. If someone was in a chat conversation about something math related, like "if a<b and b<c, then a<c", this all gets stripped after the first a.

@pauln
Copy link
Contributor

pauln commented Aug 14, 2019

Further to this, it only happens if the < is immediately followed by a letter.

a<b becomes a
a < b becomes a &lt; b
a<3 becomes a&lt;3

@marotpam
Copy link

Hi @buro9 any updates on this? Or has this problem been tackled in any other issue?

@theoptz
Copy link

theoptz commented Dec 27, 2022

have the same issue

package main

import (
 "fmt"

 "github.com/microcosm-cc/bluemonday"
)

func main() {
 p := bluemonday.NewPolicy().AllowElements("p")
 fmt.Println(p.Sanitize("<p>hello</p>"))   // got "<p>hello</p>"
 fmt.Println(p.Sanitize("<p>< hello</p>")) // got "<p>&lt; hello</p>"
 fmt.Println(p.Sanitize("<p><hello</p>"))  // got "<p>" but expected "<p>&lt;hello</p>"
}

@buro9
Copy link
Member

buro9 commented Dec 27, 2022

This stems from the Go HTML tokenizer and it is performing to how I think it should, but I'd note that the tokenizer is not part of bluemonday, we just depend on it. It's maintained by the core Go team.

Specifically it's this package golang.org/x/net/html

And what is happening depends on what follows a <.

If < is encountered, that is "lt space", then it is being tokenized as a less than character followed by whitespace.

If <a is encountered, that is "lt letter-a", then it is being tokenized as the start of a tag... in this case a would be anchor, so the start of an anchor tag. And what happens next is according to the tokenizer now thinking it's in an HTML element declaration and it's consuming input awaiting a closing tag of some kind, a >.

To the examples given:

  • By pauln on 14th Aug: 2019... it only happens when followed by a letter, because the rules on how to parse HTML are contained in the HTML tokenizer package and it understands that HTML element names starts with letters.
  • By theoptz 7 minutes ago... it only happens in the example given as the lt followed by h is interpreted as an opening HTML tag.

bluemonday reads the token type for each token read, and bluemonday receives lt letters as the start of a HTML tag.

this behaviour is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants