FIX: broken onebox images due to escape_uri bugs #17840

SamSaffron · 2022-08-09T04:43:29Z

normalized_encode in addressable is buggy due to:
sporkmonger/addressable#472

New implementation avoids any escaping (and only performs basic normalization)
if URL is already valid.

Vast majority to the calls to escape_uri start with valid urls.

This also leaves an edge case around "part" escaped urls, where some chars
are escaped and others are not. In those cases addressable may corrupt stuff.

Also added support for unicode domain names and emoji domain names
with escape uri

This removes an uneeded hack checking for pre-signed urls, which are now
handled by the general case due to starting off valid and only being minimally
normalized. Previous test case continues to pass.

UrlHelper.s3_presigned_url? which was somewhat wide was removed.

davidtaylorhq · 2022-08-09T09:05:39Z

lib/url_helper.rb

+    url = uri.to_s
+
+    # edge case where we expect mailto:test%40test.com to normalize to mailto:test@test.com
+    if url.match?(/\A#{URI::regexp}\z/) && !url.match(/\Amailto/)


/\Amailto/ would catch "mailto.example.com". We should check for the colon as well

Suggested change

if url.match?(/\A#{URI::regexp}\z/) && !url.match(/\Amailto/)

if url.match?(/\A#{URI::regexp}\z/) && !url.match(/\Amailto:/)

will push a patch 👀

Unfortunately, matching URI::regexp does not guarantee that the URL can be parsed.

pry(main)> URI::regexp.match?("https://éxample.com") => true pry(main)> URI.parse("https://éxample.com") URI::InvalidURIError: URI must be ascii only "https://\u00E9xample.com"

I think we should try URI.parse, and then fallback to addressable if an exception is raised. Will push a patch 👀

normalized_encode in addressable has a number of issues, including sporkmonger/addressable#472 To temporaily work around those issues for the majority of cases, we try parsing with `::URI`. If that fails (e.g. due to non-ascii characters) then we will fall back to addressable. Hopefully we can simplify this back to `Addressable::URI.normalized_encode` in the future. This commit also adds support for unicode domain names and emoji domain names with escape_uri. This removes an unneeded hack checking for pre-signed urls, which are now handled by the general case due to starting off valid and only being minimally normalized. Previous test case continues to pass. UrlHelper.s3_presigned_url? which was somewhat wide was removed.

This is a much better description of its function. It performs idempotent normalization of a URL. If consumers truly need to `encode` a URL (including double-encoding of existing encoded entities), they can use the existing `.encode` method.

ZogStriP approved these changes Aug 9, 2022

View reviewed changes

davidtaylorhq reviewed Aug 9, 2022

View reviewed changes

SamSaffron and others added 2 commits August 9, 2022 11:27

davidtaylorhq force-pushed the normalize-implementation branch from b29e7ab to 6062c9a Compare August 9, 2022 10:28

davidtaylorhq merged commit 3c81683 into main Aug 9, 2022

davidtaylorhq deleted the normalize-implementation branch August 9, 2022 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: broken onebox images due to escape_uri bugs #17840

FIX: broken onebox images due to escape_uri bugs #17840

SamSaffron commented Aug 9, 2022

davidtaylorhq Aug 9, 2022

davidtaylorhq Aug 9, 2022

	if url.match?(/\A#{URI::regexp}\z/) && !url.match(/\Amailto/)
	if url.match?(/\A#{URI::regexp}\z/) && !url.match(/\Amailto:/)

FIX: broken onebox images due to escape_uri bugs #17840

FIX: broken onebox images due to escape_uri bugs #17840

Conversation

SamSaffron commented Aug 9, 2022

davidtaylorhq Aug 9, 2022

Choose a reason for hiding this comment

davidtaylorhq Aug 9, 2022

Choose a reason for hiding this comment