Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force JSoup to ignore custom HTML tags #2101

Open
SkyAphid opened this issue Jan 10, 2024 · 2 comments
Open

Force JSoup to ignore custom HTML tags #2101

SkyAphid opened this issue Jan 10, 2024 · 2 comments

Comments

@SkyAphid
Copy link

SkyAphid commented Jan 10, 2024

I'm working with some code that is parsing HTML. This API has worked great so far for being able to dig the data I need out and easily read it, but I've ran into an issue where JSoup is inserting html into the text where it's not wanted or needed. This is supposedly a feature, but unfortunately it's completely ruining my entire implementation.

Here is the text:
<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>

It's pretty normal outside of the custom tag that's been added. The program is meant to parse that itself and change it into a name. I had presumed that if JSoup did not recognize a tag, it would leave it alone. But instead, it's mangling the text into this:

Oh, hello! You must be the person I've been waiting on all morning. You wouldn't happen to be <player>
    would you?
  </player></u></strong> <player>

It seemingly even adds a line break for some reason, and then also randomly adds another player tag onto the end, which confuses the system even more.

Is there a way to toggle this functionality off entirely, and have JSoup stick to tags it specifically recognizes? I'd like to solve this, since if I can't keep our custom tags formatted like html tags, then I'll have to write a whole other system to parse a different format with something like [[]], which would be a bit redundant.

Also, just to clarify, all I want is for JSoup to ignore my custom tag entirely. I essentially want it to stay in its own lane and only parse pure HTML, and then ignore anything that isn't.

Thank you for your time.

@m-heider
Copy link

The reason for this behavior is that your HTML is invalid.

Any tag must have an end tag, aside from a few exceptions listed here:
https://html.spec.whatwg.org/multipage/syntax.html#optional-tags

Also, custom elements must contain a hyphen (e.g <a-player>) but JSoup does not seem to enforce this.
https://html.spec.whatwg.org/multipage/custom-elements.html#custom-elements-core-concepts

I am not aware of any setting that ignores custom tags but there are two other options:

  1. You escape the angle brackets in <player>:
org.jsoup.nodes.Document doc;
String output;
org.jsoup.nodes.Document.OutputSettings outputSettings;

doc = Jsoup.parse("""
                  <!DOCTYPE html>
                  <html lang="en">
                    <head><title>Title</title></head>
                    <body>
                      <u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be &lt;player&gt; would you?</u></strong>
                    </body>
                  </html>
                  """);

output = Parser.unescapeEntities(doc.select("body").html(), true);

Output:

<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>
  1. You deal with the end-tag in your program, disable the additional line breaks in JSoup and hope that future versions of JSoup will neither enforce the rules about end tags nor hyphens in custom elements:
org.jsoup.nodes.Document doc;
String output;
org.jsoup.nodes.Document.OutputSettings outputSettings;

doc = Jsoup.parse("""
                  <!DOCTYPE html>
                  <html lang="en">
                    <head><title>Title</title></head>
                    <body>
                      <u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>
                    </body>
                  </html>
                  """);
                  
outputSettings = new org.jsoup.nodes.Document.OutputSettings();
outputSettings.prettyPrint(false);
doc.outputSettings(outputSettings);

output = doc.select("body").html().trim();

Output:

<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</player></u></strong>

@SkyAphid
Copy link
Author

SkyAphid commented Feb 6, 2024

Thank you for the response, and I apologize for my late one.

I was aware that it was identifying it as a tag and trying to treat it as such. I was able to work around it in my program thankfully and circumvent the entire thing. Not so fortunately, it ended massively overcomplicating my code.

The problem that this API has, in my opinion, is that there is no way to turn off the autocorrecting of the parse function. It's not that I'm requesting that the API ignore them entirely, but in my opinion, there should be a way to have JSoup parse the strings, and simply not call whatever function is inserting text into my Strings without my permission. It's worsened by the fact I have no control whatsoever, even having a callback when it edits the string would be nice, mostly so I could just override it and have it not touch the string.

If this project is ever updated, I suggest the feature to work something like this:
JSoup.setFixErrors(false);

If this is set to false, then the code that inserting the end tag automatically will simply not be called, and the text will not be parsed by the system. Ideally, it'd also include an optional callback that catches the "error" and feeds it into the function.

If I could please be directed to the code in this API that handles this autocorrecting functionality, perhaps I could look into adding the support to help out, or at least have the change locally.

Thank you again for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants