Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should enforce that tags start with ascii characters only #1006

Closed
jackila opened this issue Jan 16, 2018 · 6 comments
Closed

Should enforce that tags start with ascii characters only #1006

jackila opened this issue Jan 16, 2018 · 6 comments
Assignees
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@jackila
Copy link

jackila commented Jan 16, 2018

When I use Jsoup.parse('') to parse the html

<p>5.到推机制,推荐会员匹配成功超过24小时不打款帐号会被冻结,扣领导人
<会员挂单金额5%>动态做为惩罚</p>
<p><br></p>

The result come out like this:

<p>5.到推机制,推荐会员匹配成功超过24小时不打款帐号会被冻结,扣领导人
<会员挂单金额5%>
动态做为惩罚
</会员挂单金额5%></p>
<p><br /></p>

apparently <会员挂单金额5%> is not a tag ,is there any method can solve it?

@krystiangorecki
Copy link
Contributor

If that's the only problematic tag you can escape it manually:

html = html.replaceAll("<会员挂单金额5%>", "&lt;会员挂单金额5%&gt;");

@ghost
Copy link

ghost commented Jun 27, 2018

How to handle this issue if there are multiple problematic tags?

@SliverySky
Copy link

In this version, Chinese character is also considered as legal tag. e.g<一></一> this is not allow in chrome. String "<一>"will be mistaken as unclosed tag and be replace by <一></一>. The reason is that Character.isLetter() will consider all character in all language as letter.https://github.com/jhy/jsoup/pull/1390

@jhy jhy changed the title How to ignore illegal tag in html body Should enforce that tags start with ascii characters only Jan 11, 2021
@jhy jhy added the bug Confirmed bug that we should fix label Jan 11, 2021
@jhy
Copy link
Owner

jhy commented Jan 11, 2021

Per the HTML spec, and in checking in Chrome: tags must start with an ascii character, and any character after that is allowed.

See tag open state followed by tag name state.

So, valid tags should be:

<a>
<a会员挂单金额5>
<table(╯°□°)╯>

Invalid:

<一>
<会员挂单金额5>
<(╯°□°)╯>

@SliverySky
Copy link

SliverySky commented Jan 11, 2021 via email

@jhy jhy closed this as completed in e6b11b0 Aug 12, 2021
@jhy jhy self-assigned this Aug 12, 2021
@jhy jhy added this to the 1.14.2 milestone Aug 12, 2021
@jhy
Copy link
Owner

jhy commented Aug 12, 2021

Thanks, fixed now

@jhy jhy added the fixed label Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

4 participants