Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy with .NET normalization (encoded characters in path) #39

Open
dentarg opened this issue Sep 22, 2015 · 5 comments
Open

Discrepancy with .NET normalization (encoded characters in path) #39

dentarg opened this issue Sep 22, 2015 · 5 comments
Labels

Comments

@dentarg
Copy link
Collaborator

dentarg commented Sep 22, 2015

Document 11136870104614552724 has

OriginalUrl https://emani85.wordpress.com/2015/09/22/%e0%a6%87%e0%a6%b8%e0%a6%b0%e0%a6%be%e0%a7%9f%e0%a7%87%e0%a6%b2-%e0%a6%a5%e0%a7%87%e0%a6%95%e0%a7%87-%e0%a6%a1%e0%a7%8d%e0%a6%b0%e0%a7%8b%e0%a6%a8-%e0%a6%95%e0%a6%bf%e0%a6%a8%e0%a6%9b%e0%a7%87/

and

Url (normalized) https://emani85.wordpress.com/2015/09/22/ইসরায়েল-থেকে-ড্রোন-কিনছে

twingly-url can't normalize OriginalUrl -> Url:

$ git diff
diff --git a/spec/lib/twingly/url/normalization_spec.rb b/spec/lib/twingly/url/normalization_spec.rb
index 1b6b4ab..eb7f6a1 100644
--- a/spec/lib/twingly/url/normalization_spec.rb
+++ b/spec/lib/twingly/url/normalization_spec.rb
@@ -227,5 +227,19 @@ describe Twingly::URL::Normalizer do
       url = "Just some text"
       expect(normalizer.normalize_url(url)).to be_nil
     end
+
+    it "handles bengali charachters in path" do
+      url = "https://emani85.wordpress.com/2015/09/22/ইসরায়েল-থেকে-ড্রোন-কিনছে"
+      expected = "https://emani85.wordpress.com/2015/09/22/%e0%a6%87%e0%a6%b8%e0%a6%b0%e0%a6%be%e0%a7%9f%e0%a7%87%e0%a6%b2-%e0%a6%a5%e0%a7%87%e0%a6%95%e0%a7%87-%e0%a6%a1%e0%a7%8d%e0%a6%b0%e0%a7%8b%e0%a6%a8-%e0%a6%95%e0%a6%bf%e0%a6%a8%e0%a6%9b%e0%a7%87/"
+
+      expect(normalizer.normalize_url(url)).to eq(url)
+    end
+
+    it "handles encoded bengali charachters in path" do
+      url = "https://emani85.wordpress.com/2015/09/22/%e0%a6%87%e0%a6%b8%e0%a6%b0%e0%a6%be%e0%a7%9f%e0%a7%87%e0%a6%b2-%e0%a6%a5%e0%a7%87%e0%a6%95%e0%a7%87-%e0%a6%a1%e0%a7%8d%e0%a6%b0%e0%a7%8b%e0%a6%a8-%e0%a6%95%e0%a6%bf%e0%a6%a8%e0%a6%9b%e0%a7%87/"
+      expected = "https://emani85.wordpress.com/2015/09/22/ইসরায়েল-থেকে-ড্রোন-কিনছে"
+
+      expect(normalizer.normalize_url(url)).to eq(expected)
+    end
   end
 end
Failures:

  1) Twingly::URL::Normalizer.normalize_url handles encoded bengali charachters in path
     Failure/Error: expect(normalizer.normalize_url(url)).to eq(expected)

       expected: "https://emani85.wordpress.com/2015/09/22/ইসরায়েল-থেকে-ড্রোন-কিনছে"
            got: "https://emani85.wordpress.com/2015/09/22/%e0%a6%87%e0%a6%b8%e0%a6%b0%e0%a6%be%e0%a7%9f%e0%a7%87%e0%a6%b2-%e0%a6%a5%e0%a7%87%e0%a6%95%e0%a7%87-%e0%a6%a1%e0%a7%8d%e0%a6%b0%e0%a7%8b%e0%a6%a8-%e0%a6%95%e0%a6%bf%e0%a6%a8%e0%a6%9b%e0%a7%87"

       (compared using ==)
     # ./spec/lib/twingly/url/normalization_spec.rb:242:in `block (3 levels) in <top (required)>'

Finished in 0.17025 seconds (files took 0.22043 seconds to load)
71 examples, 1 failure

Failed examples:

rspec ./spec/lib/twingly/url/normalization_spec.rb:238 # Twingly::URL::Normalizer.normalize_url handles encoded bengali charachters in path

Randomized with seed 55710

/Users/dentarg/.rubies/ruby-2.2.3/bin/ruby -I/Users/dentarg/.gem/ruby/2.2.3/gems/rspec-core-3.3.2/lib:/Users/dentarg/.gem/ruby/2.2.3/gems/rspec-support-3.3.0/lib /Users/dentarg/.gem/ruby/2.2.3/gems/rspec-core-3.3.2/exe/rspec --pattern spec/lib/\*\*/\*_spec.rb failed
@dentarg dentarg added the bug label Sep 22, 2015
@dentarg
Copy link
Collaborator Author

dentarg commented Sep 22, 2015

pushed an temporary branch with the specs, https://github.com/twingly/twingly-url/commits/tmp/issue/39

@twingly-mob
Copy link
Member

Idea: take a large amount of OriginalUrls from TwinglySearch and normalize and compare with their Url (the .NET normalized URL)

@dentarg
Copy link
Collaborator Author

dentarg commented Jul 28, 2016

Yet another example (document 1001201892392545757)

(normalized) Url  https://varnull.adityamukerjee.net/2015/03/17/i-can-text-you-
OriginalUrl       https://varnull.adityamukerjee.net/2015/03/17/i-can-text-you-%f0%9f%92%a9-but-i-cant-write-my-name/
[6] pry(main)> Twingly::URL.parse("https://varnull.adityamukerjee.net/2015/03/17/i-can-text-you-%f0%9f%92%a9-but-i-cant-write-my-name/").normalized.to_s
=> "https://varnull.adityamukerjee.net/2015/03/17/i-can-text-you-%f0%9f%92%a9-but-i-cant-write-my-name"

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

This is probably related to MySQL and utf8mb4, Url above looks cut off. I guess it wasn't encoded before trying to .NET normalize it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants