Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow custom scrubbers to leverage the HTML5lib scrubbing already written #14

Open
flavorjones opened this issue Jan 28, 2010 · 14 comments
Labels

Comments

@flavorjones
Copy link
Owner

A couple of commonly requested features:

  • add or remove attributes from the whitelists
  • turn off CSS scrubbing
@ruckus
Copy link

ruckus commented Jan 28, 2010

  • 1 on this ticket / request. I wanted more custom control of my elements/attributes from the whitelist set and I had to achieve it like so:

http://gist.github.com/289027

@wbharding
Copy link

I'm trying to find a good way to add to the whitelist attributes right now and am coming up empty on a straightforward way to monkeypatch. I just want to add a single element, but it seems excessive hard given the way that whitelist.rb declares the constants and then digetsts them permanently via the method in whitelist.rb such that I can't even seem to monkeypatch it.

@flavorjones
Copy link
Owner Author

I hear you! I'll be working on Loofah a bit over the next couple of weeks, and this will be one of the things I'll work on.

@wbharding
Copy link

fwiw, I did figure out how to monkeypatch it. Just add a new key/value to the HashedWhitelist. But of course it's always a tad nicer when one doesn't need to monkeypatch.

@electrum
Copy link

Any thoughts or progress on this? I need to add and remove some whitelist attributes.

@flavorjones
Copy link
Owner Author

Just release 1.0.0, this is probably my next priority.

Any thoughts on what you think the API should look like to control whitelists?

@bf4
Copy link

bf4 commented Mar 19, 2012

I have some almost complete work I've been doing on a whitelist for elements and attributes, just fyi (the usecase of valid with nested invalid with nested valid is broken still) https://github.com/bf4/Notes/blob/master/code/ruby/html_processing.rb when it's ready for a pull request, I'll do that. in the meantime, just an fyi

@flavorjones
Copy link
Owner Author

It's worth noting that I've got a branch somewhere that I started, which implements a Rails-internals-compatible implementation of whitelists. This is so that, at some point, Loofah may be a pluggable sanitizer for any Rails app.

I should probably finish that up. ;)

@bf4
Copy link

bf4 commented Apr 9, 2013

I still need to write a pull request, but the WhitelistTagScrubber really does work https://github.com/bf4/Notes/blob/loofah-testing/code/ruby/html_processing.rb

# usage
# all_attributes = ['id','class']
# tags_we_want =
#   {
#   'br' => [],
#   'ol' => all_attributes,
#   'ul' => all_attributes,
#   'li' => all_attributes,
#   'strong' => all_attributes,
#   'p' => all_attributes,
#   'i' => all_attributes,
#   'em' => all_attributes,
#   'a' => ['href','rel'].concat(all_attributes)
# }
# updater = CustomScrubber.new
# updater.clean_html(message_dirty, tags_we_want.keys, tags_we_want) do |html|
#      updater.line_breaks_to_br(html)
# end


class WhiteListTagScrubber < Loofah::Scrubber
  attr_reader :tags, :attributes
  def initialize(options = {}, &block)
    @tags = Array(options.delete(:tags))
    @attributes = options.delete(:attributes) || {}
    super(options, &block)
  end
  def debug(type,&block)
    if ENV['DEBUG'] =~ /true/i
      puts "**** #{type}, #{block.call.inspect}"
    end
  end
  def scrub(node)
    debug("processing") {  "#{node.type}: #{node.name}, namespaces #{node.namespaces.inspect}" }
    case node.type
    when Nokogiri::XML::Node::ELEMENT_NODE

      # see strip: return CONTINUE if html5lib_sanitize(node) == CONTINUE
      if tags.include? node.name
        # remove all attributes except the ones we whitelisted per tag
        clean_with_attributes(node,true)
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      else
        # remove all attributes
        clean_with_attributes(node,false)
        # remove the node and its contents entirely.
        # there's nothing good in these
        if %w{script style meta link}.include?(node.name)
          node.remove
        else
          # remove this undesired node and scrub each child node
          remove_node_and_add_children(node)
        end
        return Loofah::Scrubber::CONTINUE if node.namespaces.empty?
      end
    when Nokogiri::XML::Node::TEXT_NODE, Nokogiri::XML::Node::CDATA_SECTION_NODE
      return Loofah::Scrubber::CONTINUE
    end
    node.remove
    Loofah::Scrubber::STOP
  end
  def remove_node_and_add_children(node)
    # alternatively see :strip
    # node.before node.children
    current_node = node
    node.children.each do |kid|
      previous_node = current_node
      current_node = current_node.add_next_sibling(kid)
      scrub(previous_node) unless previous_node == node
    end
    scrub(current_node) unless current_node == node
    node.remove
  end
  def clean_with_attributes(node,use_attributes=true)
    attr_array = use_attributes ? attributes[node.name] : nil
    node.attributes.each { |attr| node.remove_attribute(attr.first) unless Array(attr_array).include?(attr.first)}
  end
end

class CustomScrubber
  # uses Loofah
  def clean_html(html, tags=[],attributes={})

    yield Loofah.fragment(html).scrub!(scrub_tags_except(tags,attributes)).to_s

  end
  # perhaps also see the scrubber
  # :newline_block_elements
  def line_breaks_to_br(html)
    html.gsub(/\r?\n/,'<br>')
  end
  # tags in an array of tags
  # attributes is a hash of the previous tags with an array of their whitelisted attributes
  # needs to be DRYed
  def scrub_tags_except(tags,attributes)
    options = {:tags => tags, :attributes => attributes }
    WhiteListTagScrubber.new(options)
  end
end

@abitdodgy
Copy link

Curious, anything new on this issue? What's the current way of handling custom scrubbers? They seems a bit laborious (relative to how Sanitize handles custom configs), the solutions here.

@saneshark
Copy link

👍 Completely agree with @abitdodgy

Just take a look at how simple and straight forward this DSL is: https://github.com/rgrove/sanitize/blob/master/lib/sanitize/config/relaxed.rb

Having a means of being able to process something like that and perhaps even having additional regex on attribute values such as background src image, etc would be a big win. I would just use Sanitize, but seeing as this is getting merged in Rails 4.2 thought it would be a useful addition.

@DanDevine
Copy link

+1, would really like this feature.

@jemminger
Copy link

+1 too, 12 years later 🙁

@flavorjones
Copy link
Owner Author

@jemminger Please consider using https://github.com/rgrove/sanitize for a customizable sanitizer

mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
mamhoff added a commit to mamhoff/alchemy_cms that referenced this issue Jan 19, 2024
We found that using Rails' HTML sanitizer does more than we want the
Richtext sanitization to do: It does not just remove nodes that are not
in the safelist, it also escapes some markup (especially in links).

This introduces a custom Loofah "scrubber" that only cares about the
element safelist.

The `sanitized_body` attribute is not for escaping at the view layer,
where all these safety precautions are necessary, but just for making
sure admin's don't use iframes when we don't want to.

See the following related issues and commits:
rails/rails-html-sanitizer@f3ba1a8
sparklemotion/nokogiri#3104
sparklemotion/nokogiri#969 (comment)
flavorjones/loofah#14 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants