-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Should enforce that tags start with ascii characters only #1006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If that's the only problematic tag you can escape it manually:
|
How to handle this issue if there are multiple problematic tags? |
In this version, Chinese character is also considered as legal tag. e.g<一></一> this is not allow in chrome. String "<一>"will be mistaken as unclosed tag and be replace by <一></一>. The reason is that Character.isLetter() will consider all character in all language as letter.https://github.com/jhy/jsoup/pull/1390 |
Per the HTML spec, and in checking in Chrome: tags must start with an ascii character, and any character after that is allowed. See tag open state followed by tag name state. So, valid tags should be: <a一>
<a会员挂单金额5>
<table(╯°□°)╯> Invalid:
|
Thank you for your replay which let me know the reason for this problem (why `<一>` can't work in Chrome). And I will standardize the format of PRs. Thank you for reminder!
|
Thanks, fixed now |
When I use Jsoup.parse('') to parse the html
<p>5.到推机制,推荐会员匹配成功超过24小时不打款帐号会被冻结,扣领导人
<会员挂单金额5%>动态做为惩罚</p>
<p><br></p>
The result come out like this:
<p>5.到推机制,推荐会员匹配成功超过24小时不打款帐号会被冻结,扣领导人
<会员挂单金额5%>
动态做为惩罚
</会员挂单金额5%></p>
<p><br /></p>
apparently <会员挂单金额5%> is not a tag ,is there any method can solve it?
The text was updated successfully, but these errors were encountered: