allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

flavorjones · 2010-01-28T07:59:29Z

A couple of commonly requested features:

add or remove attributes from the whitelists
turn off CSS scrubbing

ruckus · 2010-01-28T19:07:35Z

1 on this ticket / request. I wanted more custom control of my elements/attributes from the whitelist set and I had to achieve it like so:

http://gist.github.com/289027

wbharding · 2010-05-27T06:28:14Z

I'm trying to find a good way to add to the whitelist attributes right now and am coming up empty on a straightforward way to monkeypatch. I just want to add a single element, but it seems excessive hard given the way that whitelist.rb declares the constants and then digetsts them permanently via the method in whitelist.rb such that I can't even seem to monkeypatch it.

flavorjones · 2010-05-28T13:14:54Z

I hear you! I'll be working on Loofah a bit over the next couple of weeks, and this will be one of the things I'll work on.

wbharding · 2010-05-28T16:42:38Z

fwiw, I did figure out how to monkeypatch it. Just add a new key/value to the HashedWhitelist. But of course it's always a tad nicer when one doesn't need to monkeypatch.

electrum · 2010-10-25T22:38:36Z

Any thoughts or progress on this? I need to add and remove some whitelist attributes.

flavorjones · 2010-10-26T05:14:08Z

Just release 1.0.0, this is probably my next priority.

Any thoughts on what you think the API should look like to control whitelists?

bf4 · 2012-03-19T00:48:14Z

I have some almost complete work I've been doing on a whitelist for elements and attributes, just fyi (the usecase of valid with nested invalid with nested valid is broken still) https://github.com/bf4/Notes/blob/master/code/ruby/html_processing.rb when it's ready for a pull request, I'll do that. in the meantime, just an fyi

flavorjones · 2012-03-20T21:26:02Z

It's worth noting that I've got a branch somewhere that I started, which implements a Rails-internals-compatible implementation of whitelists. This is so that, at some point, Loofah may be a pluggable sanitizer for any Rails app.

I should probably finish that up. ;)

bf4 · 2013-04-09T19:50:51Z

I still need to write a pull request, but the WhitelistTagScrubber really does work https://github.com/bf4/Notes/blob/loofah-testing/code/ruby/html_processing.rb

# usage
# all_attributes = ['id','class']
# tags_we_want =
#   {
#   'br' => [],
#   'ol' => all_attributes,
#   'ul' => all_attributes,
#   'li' => all_attributes,
#   'strong' => all_attributes,
#   'p' => all_attributes,
#   'i' => all_attributes,
#   'em' => all_attributes,
#   'a' => ['href','rel'].concat(all_attributes)
# }
# updater = CustomScrubber.new
# updater.clean_html(message_dirty, tags_we_want.keys, tags_we_want) do |html|
#      updater.line_breaks_to_br(html)
# end


class WhiteListTagScrubber < Loofah::Scrubber
  attr_reader :tags, :attributes
  def initialize(options = {}, &block)
    @tags = Array(options.delete(:tags))
    @attributes = options.delete(:attributes) || {}
    super(options, &block)
  end
  def debug(type,&block)
    if ENV['DEBUG'] =~ /true/i
      puts "**** #{type}, #{block.call.inspect}"
    end
  end
  def scrub(node)
    debug("processing") {  "#{node.type}: #{node.name}, namespaces #{node.namespaces.inspect}" }
    case node.type
    when Nokogiri::XML::Node::ELEMENT_NODE

      # see strip: return CONTINUE if html5lib_sanitize(node) == CONTINUE
      if tags.include? node.name
        # remove all attributes except the ones we whitelisted per tag
        clean_with_attributes(node,true)
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      else
        # remove all attributes
        clean_with_attributes(node,false)
        # remove the node and its contents entirely.
        # there's nothing good in these
        if %w{script style meta link}.include?(node.name)
          node.remove
        else
          # remove this undesired node and scrub each child node
          remove_node_and_add_children(node)
        end
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      end
    when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
      return Loofah::Scrubber::CONTINUE
    end
    node.remove
    Loofah::Scrubber::STOP
  end
  def remove_node_and_add_children(node)
    # alternatively see :strip
    # node.before node.children
    current_node = node
    node.children.each do |kid|
      previous_node = current_node
      current_node = current_node.add_next_sibling(kid)
      scrub(previous_node) unless previous_node == node
    end
    scrub(current_node) unless current_node == node
    node.remove
  end
  def clean_with_attributes(node,use_attributes=true)
    attr_array = use_attributes ? attributes[node.name] : nil
    node.attributes.each { |attr| node.remove_attribute(attr.first) unless Array(attr_array).include?(attr.first)}
  end
end

class CustomScrubber
  # uses Loofah
  def clean_html(html, tags=[],attributes={})

    yield Loofah.fragment(html).scrub!(scrub_tags_except(tags,attributes)).to_s

  end
  # perhaps also see the scrubber
  # :newline_block_elements
  def line_breaks_to_br(html)
    html.gsub(/\r?\n/,'<br>')
  end
  # tags in an array of tags
  # attributes is a hash of the previous tags with an array of their whitelisted attributes
  # needs to be DRYed
  def scrub_tags_except(tags,attributes)
    options = {:tags => tags, :attributes => attributes }
    WhiteListTagScrubber.new(options)
  end
end

abitdodgy · 2013-11-25T14:07:33Z

Curious, anything new on this issue? What's the current way of handling custom scrubbers? They seems a bit laborious (relative to how Sanitize handles custom configs), the solutions here.

saneshark · 2014-10-08T18:52:54Z

👍 Completely agree with @abitdodgy

Just take a look at how simple and straight forward this DSL is: https://github.com/rgrove/sanitize/blob/master/lib/sanitize/config/relaxed.rb

Having a means of being able to process something like that and perhaps even having additional regex on attribute values such as background src image, etc would be a big win. I would just use Sanitize, but seeing as this is getting merged in Rails 4.2 thought it would be a useful addition.

DanDevine · 2018-03-19T19:33:31Z

+1, would really like this feature.

jemminger · 2022-01-05T22:15:04Z

+1 too, 12 years later 🙁

flavorjones · 2022-01-06T14:05:53Z

@jemminger Please consider using https://github.com/rgrove/sanitize for a customizable sanitizer

We found that using Rails' HTML sanitizer does more than we want the Richtext sanitization to do: It does not just remove nodes that are not in the safelist, it also escapes some markup (especially in links). This introduces a custom Loofah "scrubber" that only cares about the element safelist. The `sanitized_body` attribute is not for escaping at the view layer, where all these safety precautions are necessary, but just for making sure admin's don't use iframes when we don't want to. See the following related issues and commits: rails/rails-html-sanitizer@f3ba1a8 sparklemotion/nokogiri#3104 sparklemotion/nokogiri#969 (comment) flavorjones/loofah#14 (comment)

nzgrover mentioned this issue May 25, 2016

Allow regex in attribute whitelist. refinery/refinerycms#3183

Closed

flavorjones added the feature label Nov 13, 2017

mamhoff mentioned this issue Jan 19, 2024

Implement custom scrubber for Alchemy::Ingredients::Richtext AlchemyCMS/alchemy_cms#2700

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

flavorjones commented Jan 28, 2010

ruckus commented Jan 28, 2010

wbharding commented May 27, 2010

flavorjones commented May 28, 2010

wbharding commented May 28, 2010

electrum commented Oct 25, 2010

flavorjones commented Oct 26, 2010

bf4 commented Mar 19, 2012

flavorjones commented Mar 20, 2012

bf4 commented Apr 9, 2013

abitdodgy commented Nov 25, 2013

saneshark commented Oct 8, 2014

DanDevine commented Mar 19, 2018

jemminger commented Jan 5, 2022

flavorjones commented Jan 6, 2022

allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

Comments

flavorjones commented Jan 28, 2010

ruckus commented Jan 28, 2010

wbharding commented May 27, 2010

flavorjones commented May 28, 2010

wbharding commented May 28, 2010

electrum commented Oct 25, 2010

flavorjones commented Oct 26, 2010

bf4 commented Mar 19, 2012

flavorjones commented Mar 20, 2012

bf4 commented Apr 9, 2013

abitdodgy commented Nov 25, 2013

saneshark commented Oct 8, 2014

DanDevine commented Mar 19, 2018

jemminger commented Jan 5, 2022

flavorjones commented Jan 6, 2022