feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

flavorjones · 2023-12-04T22:32:00Z

From a slack thread at https://rubyonrails-link.slack.com/archives/C05054QPL/p1700056469860939

Has anyone found a way to use Nokogiri (or Loofah) to replace double-break tags with closing/opening paragraph tags? I have a lot of code in the database with this madness, and I would like to scrub it back out:
Some text here in a logical paragraph.
 
 
 Some more text, apparently a second paragraph.
 
 
 Et cetera...

and I replied with:

#!/usr/bin/env ruby

require "nokogiri"

html = <<~HTML
<p>Some text here in a logical paragraph.
  <br>
  <br>
  Some more text, apparently a second paragraph.
  <br>
  <br>
  Et cetera...
</p>
<p>foo
  <br id=1>
  <br id=2>
  bar
  <br id=11>
  <br id=12>
  bar
</p>
<p>baz
  <br id=3>
</p>
<notp>foo
  <br id=4>
  <br id=5>
</notp>
HTML

doc = Nokogiri::HTML5::Document.parse(html)
puts doc.to_html

p_with_brs = doc.xpath(%q{//p[br[following-sibling::br]]})

p_with_brs.each do |p|
  new_p = p.add_previous_sibling("<p>").first

  # remove blank text nodes
  p.children.each do |c|
    c.unlink if c.text? && c.blank?
  end

  p.children.each do |c|
    next if c.parent.nil? # already unlinked
    if c.name == "br" && c.next_sibling.name == "br"
      new_p = p.add_previous_sibling("<p>").first
      c.next_sibling.unlink
      c.unlink
    else
      c.parent = new_p
    end
  end

  p.unlink
end

puts doc.to_html

which outputs:

<html><head></head><body><p>Some text here in a logical paragraph.
  </p><p>
  Some more text, apparently a second paragraph.
  </p><p>
  Et cetera...
</p>
<p>foo
  </p><p>
  bar
  </p><p>
  bar
</p>
<p>baz
  <br id="3">
</p>
<notp>foo
  <br id="4">
  <br id="5">
</notp>
</body></html>

I think this could be useful in a scrubber if it's something people commonly do.

cc @walterdavis

The text was updated successfully, but these errors were encountered:

josecolella · 2024-05-08T15:39:40Z

@torihuang and @josecolella are working on this

torihuang · 2024-05-08T15:45:58Z

Our initial thoughts are that we should implement a new scrub ability like doc.scrub!(:breakpoint) which would removes all instances of  .

josecolella · 2024-05-08T19:09:51Z

This is the PR that should get us almost all the way there: #284

flavorjones added the feature label Dec 4, 2023

josecolella mentioned this issue May 8, 2024

[RubyConf] Create scrubber for replacing double breakpoints into paragraph nodes #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

flavorjones commented Dec 4, 2023

josecolella commented May 8, 2024

torihuang commented May 8, 2024

josecolella commented May 8, 2024

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

feat: encapsulate some whitespace-handling into a scrubber (or scrubbers) #279

Comments

flavorjones commented Dec 4, 2023

josecolella commented May 8, 2024

torihuang commented May 8, 2024

josecolella commented May 8, 2024