Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#to_text doesn't handle <br> elements well. #225

Closed
der-flo opened this issue Feb 11, 2022 · 4 comments · Fixed by #226
Closed

#to_text doesn't handle <br> elements well. #225

der-flo opened this issue Feb 11, 2022 · 4 comments · Fixed by #226

Comments

@der-flo
Copy link

der-flo commented Feb 11, 2022

#to_text doesn't handle <br> elements explicitly, so phrases like "foo<br>bar" get sanitzed to "foobar". In our use-case a line break ("foo\nbar") is a more practical result.

So I wrote a custom scrubber:

class HTMLToPlainText
  class BreakElementScrubber < Loofah::Scrubber
    def scrub(node)
      return CONTINUE unless node.name == 'br'
 
      node.add_previous_sibling Nokogiri::XML::Text.new("\n", node.document)
      node.remove
    end
  end

  def call(html_string)
    Loofah::HTML::Document
      .parse(html_string)
      .scrub!(:prune)
      .scrub!(BreakElementScrubber.new)
      .to_text(encode_special_chars: false)
      .strip
  end
end

Is this something valuable which should find it's way in the project via a pull request? If yes, could a maintainer give me implementation hints etc.? I'm willing to help.

@flavorjones
Copy link
Owner

Hi, @der-flo! Thanks for pointing this out, that seems like a really good feature to add. I don't think we need a new scrubber for it, though, I think we should modify the current behavior of #to_text (or, really, the behavior of the underlying scrubber, Loofah::Scrubbers::NewlineBlockElements.

That scrubber currently adds newlines for "block" elements in HTML4 and HTML5 plus a loose set of elements that can contain block elements. But it doesn't pay attention to the element br which is the direct analog of a newline in HTML.

I'll write up a PR for this now, it should be a very small change.

flavorjones added a commit that referenced this issue Feb 11, 2022
which probably should have always been the desired behavior

Closes #225
@flavorjones
Copy link
Owner

Loofah 2.14.0 has been released with #to_text inserting newlines for <br> elements. I hope this meets your needs!

@der-flo
Copy link
Author

der-flo commented Feb 23, 2022

Your solution works fine, thanks!

@flavorjones
Copy link
Owner

Happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants