Skip to content

Commit

Permalink
feat: Node#to_text replaces <br> with a newline
Browse files Browse the repository at this point in the history
which probably should have always been the desired behavior

Closes #225
  • Loading branch information
flavorjones committed Feb 11, 2022
1 parent eee3e65 commit c37bba7
Show file tree
Hide file tree
Showing 6 changed files with 37 additions and 13 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
@@ -1,5 +1,12 @@
# Changelog

## next / unreleased

### Features

* The `#to_text` method on `Loofah::HTML::{Document,DocumentFragment}` replaces `<br>` linebreak elements with a newline. [[#225](https://github.com/flavorjones/loofah/issues/225)]


## 2.13.0 / 2021-12-10

### Bug fixes
Expand Down
9 changes: 4 additions & 5 deletions README.md
Expand Up @@ -133,13 +133,12 @@ and `text` to return plain text:
doc.text # => "ohai! div is safe "
```

Also, `to_text` is available, which does the right thing with
whitespace around block-level elements.
Also, `to_text` is available, which does the right thing with whitespace around block-level and line break elements.

``` ruby
doc = Loofah.fragment("<h1>Title</h1><div>Content</div>")
doc.text # => "TitleContent" # probably not what you want
doc.to_text # => "\nTitle\n\nContent\n" # better
doc = Loofah.fragment("<h1>Title</h1><div>Content<br>Next line</div>")
doc.text # => "TitleContentNext line" # probably not what you want
doc.to_text # => "\nTitle\n\nContent\nNext line\n" # better
```

### Loofah::XML::Document and Loofah::XML::DocumentFragment
Expand Down
7 changes: 5 additions & 2 deletions lib/loofah/elements.rb
Expand Up @@ -70,8 +70,6 @@ module Elements
video
]

STRICT_BLOCK_LEVEL = STRICT_BLOCK_LEVEL_HTML4 + STRICT_BLOCK_LEVEL_HTML5

# The following elements may also be considered block-level
# elements since they may contain block-level elements
LOOSE_BLOCK_LEVEL = Set.new %w[dd
Expand All @@ -86,7 +84,12 @@ module Elements
tr
]

# Elements that aren't block but should generate a newline in #to_text
INLINE_LINE_BREAK = Set.new(["br"])

STRICT_BLOCK_LEVEL = STRICT_BLOCK_LEVEL_HTML4 + STRICT_BLOCK_LEVEL_HTML5
BLOCK_LEVEL = STRICT_BLOCK_LEVEL + LOOSE_BLOCK_LEVEL
LINEBREAKERS = BLOCK_LEVEL + INLINE_LINE_BREAK
end

::Loofah::MetaHelpers.add_downcased_set_members_to_all_set_constants ::Loofah::Elements
Expand Down
8 changes: 4 additions & 4 deletions lib/loofah/instance_methods.rb
Expand Up @@ -112,11 +112,11 @@ def text(options = {})
# Returns a plain-text version of the markup contained by the
# fragment, with HTML entities encoded.
#
# This method is slower than #to_text, but is clever about
# whitespace around block elements.
# This method is slower than #text, but is clever about
# whitespace around block elements and line break elements.
#
# Loofah.document("<h1>Title</h1><div>Content</div>").to_text
# # => "\nTitle\n\nContent\n"
# Loofah.document("<h1>Title</h1><div>Content<br>Next line</div>").to_text
# # => "\nTitle\n\nContent\nNext line\n"
#
def to_text(options = {})
Loofah.remove_extraneous_whitespace self.dup.scrub!(:newline_block_elements).text(options)
Expand Down
9 changes: 7 additions & 2 deletions lib/loofah/scrubbers.rb
Expand Up @@ -240,8 +240,13 @@ def initialize
end

def scrub(node)
return CONTINUE unless Loofah::Elements::BLOCK_LEVEL.include?(node.name)
node.add_next_sibling Nokogiri::XML::Text.new("\n#{node.content}\n", node.document)
return CONTINUE unless Loofah::Elements::LINEBREAKERS.include?(node.name)
replacement = if Loofah::Elements::INLINE_LINE_BREAK.include?(node.name)
"\n"
else
"\n#{node.content}\n"
end
node.add_next_sibling Nokogiri::XML::Text.new(replacement, node.document)
node.remove
end
end
Expand Down
10 changes: 10 additions & 0 deletions test/integration/test_html.rb
Expand Up @@ -51,6 +51,11 @@ class IntegrationTestHtml < Loofah::TestCase
html = Loofah.fragment "<div>tweedle\n\n\t\n\s\nbeetle</div>"
assert_equal "\ntweedle\n\nbeetle\n", html.to_text
end

it "replaces <br> with newlines" do
html = Loofah.fragment("hello<div>first line<br>second line</div>goodbye")
assert_equal("hello\nfirst line\nsecond line\ngoodbye", html.to_text)
end
end

context "with an `encoding` arg" do
Expand Down Expand Up @@ -84,6 +89,11 @@ class IntegrationTestHtml < Loofah::TestCase
html = Loofah.document "<div>tweedle\n\n\t\n\s\nbeetle</div>"
assert_equal "\ntweedle\n\nbeetle\n", html.to_text
end

it "replaces <br> with newlines" do
html = Loofah.document("<body>hello<div>first line<br>second line</div>goodbye</body>")
assert_equal("hello\nfirst line\nsecond line\ngoodbye", html.to_text)
end
end
end
end

0 comments on commit c37bba7

Please sign in to comment.