Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

Open
flavorjones opened this issue Dec 4, 2023 · 3 comments
Labels

Comments

@flavorjones
Copy link
Owner

From a slack thread at https://rubyonrails-link.slack.com/archives/C05054QPL/p1700056469860939

Has anyone found a way to use Nokogiri (or Loofah) to replace double-break tags with closing/opening paragraph tags? I have a lot of code in the database with this madness, and I would like to scrub it back out:

<p>Some text here in a logical paragraph.
  <br>
  <br>
  Some more text, apparently a second paragraph.
  <br>
  <br>
  Et cetera...
</p>

and I replied with:

#!/usr/bin/env ruby

require "nokogiri"

html = <<~HTML
<p>Some text here in a logical paragraph.
  <br>
  <br>
  Some more text, apparently a second paragraph.
  <br>
  <br>
  Et cetera...
</p>
<p>foo
  <br id=1>
  <br id=2>
  bar
  <br id=11>
  <br id=12>
  bar
</p>
<p>baz
  <br id=3>
</p>
<notp>foo
  <br id=4>
  <br id=5>
</notp>
HTML

doc = Nokogiri::HTML5::Document.parse(html)
puts doc.to_html

p_with_brs = doc.xpath(%q{//p[br[following-sibling::br]]})

p_with_brs.each do |p|
  new_p = p.add_previous_sibling("<p>").first

  # remove blank text nodes
  p.children.each do |c|
    c.unlink if c.text? && c.blank?
  end

  p.children.each do |c|
    next if c.parent.nil? # already unlinked
    if c.name == "br" && c.next_sibling.name == "br"
      new_p = p.add_previous_sibling("<p>").first
      c.next_sibling.unlink
      c.unlink
    else
      c.parent = new_p
    end
  end

  p.unlink
end

puts doc.to_html

which outputs:

<html><head></head><body><p>Some text here in a logical paragraph.
  </p><p>
  Some more text, apparently a second paragraph.
  </p><p>
  Et cetera...
</p>
<p>foo
  </p><p>
  bar
  </p><p>
  bar
</p>
<p>baz
  <br id="3">
</p>
<notp>foo
  <br id="4">
  <br id="5">
</notp>
</body></html>

I think this could be useful in a scrubber if it's something people commonly do.

cc @walterdavis

@josecolella
Copy link

@torihuang and @josecolella are working on this

@torihuang
Copy link

Our initial thoughts are that we should implement a new scrub ability like doc.scrub!(:breakpoint) which would removes all instances of <br>.

@josecolella
Copy link

This is the PR that should get us almost all the way there: #284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants