Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loofah removes   #240

Closed
wizardofosmium opened this issue Jul 26, 2022 · 3 comments
Closed

Loofah removes   #240

wizardofosmium opened this issue Jul 26, 2022 · 3 comments

Comments

@wizardofosmium
Copy link

There are times when   is actually needed. Unfortunately, Loofah removes them.

> Loofah.fragment("  !=  ").to_s
=> "  !=  "

Could you either make:

  • Loofah not remove them at all, or
  • provide an option to keep them?
@flavorjones
Copy link
Owner

Hi! This unfortunately is not behavior that Loofah directly controls, it's how libxml2 parses:

>> x = Nokogiri::HTML4::DocumentFragment.parse("  !=  ")
=> 
#(DocumentFragment:0xc300 {                                              
...                                                                      
>> x.to_html
=> "  !=  "
>> x
=> 
#(DocumentFragment:0xc300 {                                              
  name = "#document-fragment",                                           
  children = [ #(Text "  !=  ")]                                    
  })                                                                     

although the gumbo parser used by Nokogiri::HTML5 is better:

>> x = Nokogiri::HTML5::DocumentFragment.parse("  !=  ")
=> 
#(DocumentFragment:0xe894 {                                              
...                                                                      
>> x.to_html
=> "  !=  "
>> x
=> 
#(DocumentFragment:0xe894 {          
  name = "#document-fragment",       
  children = [ #(Text "  !=  ")]
  })                                 

Because this behavior is inherited from libxml2, there's nothing we can easily do in Nokogiri or Loofah to change it.

Note that we're planning to update Loofah to use Nokogiri::HTML5 when it's available: #239 which is blocked on Nokogiri v1.14.0 being released (soon!).

@wizardofosmium
Copy link
Author

Thanks for the explanation @flavorjones 👍

It looks like I'll have to hack around it with something like:

string = "  !=   or  "
protected_string = string.gsub(/ /, "PROTECTEDNBSP").gsub(/ /, "PROTECTED160")
Loofah.fragment(protected_string).to_s.gsub(/PROTECTEDNBSP/, " ").gsub(/PROTECTED160/, " ")

(Just have to hope that the input doesn't contain PROTECTEDNBSP or PROTECTED160 😬)

Any other suggestions would be welcome. Cheers!

@Yegorov
Copy link

Yegorov commented Sep 14, 2022

Hello everyone, thanks for the answers!
Loofah removes not only   Also removes — « » and others
For my case it looks like this:

string = "      — « »"
protected_string = string.gsub(/&(.+?);/, '_PROTECTED\1_')
Loofah.fragment(protected_string).to_s.gsub(/_PROTECTED(.+?)_/, '&\1;')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants