Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overly-eager decoding of reserved chars when performing HTTP requests #24932

Open
eWalthert opened this issue May 9, 2023 · 14 comments
Open
Labels
bug Something isn't working

Comments

@eWalthert
Copy link

eWalthert commented May 9, 2023

Steps to reproduce the problem

Link to an Eventbrite event in a post:
e.g. https://typo.social/@lttrspc/110344236591977314 which links to https://letterspace39.eventbrite.com/

Expected behaviour

The preview card includes an image fetched from:
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Actual behaviour

Mastodon fails to fetch the image, getting a 403 from Eventbrite because of an incorrect URL (notice that %3A and %2C got decoded even though they are reserved characters):
https://img.evbuc.com/https:%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format,compress&q=75&sharp=10&rect=0,0,2160,1080&s=e263a07d9865502aeaa826c331043fa7

Detailed description

This is more general than preview cards and can be tracked down to Addressable::URI decoding percent-encoded reserved characters (sporkmonger/addressable#472).

Specifications

Mastodon 4.1.2, Ruby 2.7.0p0, PostfreSQL 15.2, Redis 5.0.7, all browsers and apps

@eWalthert eWalthert added the bug Something isn't working label May 9, 2023
@renchap
Copy link
Sponsor Member

renchap commented May 10, 2023

Can you please provide a reproduction, for example an URL we can embed into a post and see the issue?

@renchap
Copy link
Sponsor Member

renchap commented May 10, 2023

Ok, I can reproduce.

According to W3C ampersand should work everywhere since 2014.

Do you have a pointer for this? & is HTML encoding, not URI encoding, so I am not sure where it is specified that URLs can be encoded using HTML encoding.

@eWalthert
Copy link
Author

eWalthert commented May 10, 2023

@renchap I was linked to this on stackoverflow:

https://en.wikipedia.org/wiki/Query_string#Web_forms

One of the original uses was to contain the content of an [HTML form](https://en.wikipedia.org/wiki/Form_(HTML)), also known as web form. In particular, when a form containing the fields field1, field2, field3 is submitted, the content of the fields is encoded as a query string as follows:

field1=value1&field2=value2&field3=value3...
The query string is composed of a series of field-value pairs.
Within each pair, the field name and value are separated by an [equals sign](https://en.wikipedia.org/wiki/Equals_sign), "=".
The series of pairs is separated by the [ampersand](https://en.wikipedia.org/wiki/Ampersand), "&"

I'm not saying &amp should work, but & as a separator should work.
But Mastodon transforms it into a &amp that breaks the URL.

(at least, this is as much I understand. Not a programmer.)

@renchap
Copy link
Sponsor Member

renchap commented May 10, 2023

When you look at the EventBrite page you linked above (https://letterspace39.eventbrite.com), you have this in the HTML:

<meta property="og:image" content="https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&amp;auto=format%2Ccompress&amp;q=75&amp;sharp=10&amp;rect=0%2C0%2C2160%2C1080&amp;s=e263a07d9865502aeaa826c331043fa7">

So the &amp; character comes from their HTML, Mastodon does not transform it at all.

This seems incorrect to me and an issue from EventBrite, they should not HTML-encode the attribute value, as it is contained in a content tag attribute.

@eWalthert
Copy link
Author

@renchap this might be browser or system dependent. Are you by any chance on Linux?

FireFox 112 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Opera 98 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Safari 14 Mac (outdated): (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Crome 112 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

FireFox 112 Win11: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Edge 112 Win11: (Page Source) !!!
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&amp;auto=format%2Ccompress&amp;q=75&amp;sharp=10&amp;rect=0%2C0%2C2160%2C1080&amp;s=e263a07d9865502aeaa826c331043fa7

@renchap
Copy link
Sponsor Member

renchap commented May 11, 2023

Ok this is very weird. In Chrome, when using "View source", the content attribute has &amp; (which is consistent with looking at the real HTTP request). But when using Inspect, the URL is decoded and contains the proper &.

I am not sure if this is in the HTML spec, or a common behaviour.

I think it all comes from a bug in React, where it HTML-encodes the content of those attributes: facebook/react#13838

NextJS issue, with a proposed workaround: vercel/next.js#2006

I am not sure if we need to handle this specific case in Mastodon, as this is not conforming to any spec I know. But on the other hand many sites are most probably showing this issue and dont have a real way to fix it.

@ClearlyClaire do you have an opinion here?

@eWalthert
Copy link
Author

@renchap I’m not sure if this behavior is a bug.

For example; when looking at a Hex-scrambled email address in the inspector, it appears like a proper email.
Only when looking at the page source I’m seeing the &#x69;&#x6e;&#x66; coding.
(Kirby does this as an anti-spam-bot measure)

And yes, 20% of OG-image validators don’t show the picture.
But since the popular social-media platforms don’t have issues with these URLs, Eventbrite does not seem to care.

@ClearlyClaire
Copy link
Contributor

ClearlyClaire commented May 12, 2023

@ClearlyClaire do you have an opinion here?

I was confused because initially reading the report it sounded like Eventbrite was using plain & and Mastodon was transforming them into &amp;. This is not what is happening, but instead, Eventbrite uses &amp; and Mastodon does not interpret character references to replace that to a simple &.

My understanding of HTML is that HTML attributes can contain character references that should be interpreted as such. I'll look into why that isn't happening in this case and see what can be done.

@ClearlyClaire
Copy link
Contributor

Upon investigation, I find that Mastodon does interpret character references properly, but that the image still can't be fetched from a different reason.

From my testing, it appears that Mastodon is actually trying to fetch the following URL:
https://img.evbuc.com/https:%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format,compress&q=75&sharp=10&rect=0,0,2160,1080&s=e263a07d9865502aeaa826c331043fa7

Note that the URL percent-encoding is not exactly the same as in the correct link. I will be investigating why.

@eWalthert how did you observe the “Actual behavior” from your original post?

@ClearlyClaire
Copy link
Contributor

ClearlyClaire commented May 12, 2023

Note that the URL percent-encoding is not exactly the same as in the correct link. I will be investigating why.

This is because we “normalize” the given URL:

@url = Addressable::URI.parse(url).normalize

But… stopping doing that would not prevent the issue as HTTP.rb is doing the same thing (httprb/http#654):
https://github.com/httprb/http/blob/4e0696c6596123c122389402234de057fcc68550/lib/http/request.rb#L88

Disabling normalization completely is possible but pretty awkward:

diff --git a/app/lib/request.rb b/app/lib/request.rb
index 4bde6fc911..98e8427f3d 100644
--- a/app/lib/request.rb
+++ b/app/lib/request.rb
@@ -28,7 +28,7 @@ class Request
     raise ArgumentError if url.blank?
 
     @verb        = verb
-    @url         = Addressable::URI.parse(url).normalize
+    @url         = Addressable::URI.parse(url)
     @http_client = options.delete(:http_client)
     @allow_local = options.delete(:allow_local)
     @options     = options.merge(socket_class: use_proxy? || @allow_local ? ProxySocket : Socket)
@@ -92,7 +92,7 @@ class Request
     end
 
     def http_client
-      HTTP.use(:auto_inflate).timeout(TIMEOUT.dup).follow(max_hops: 3)
+      HTTP.use(:auto_inflate).use(:normalize_uri => { normalizer: ->(x) { x } }).timeout(TIMEOUT.dup).follow(max_hops: 3)
     end
   end
 

I honestly don't know if “normalizing” the URL is advisable or if we are doing it right.

EDIT: The proper RFC to cover that would probably be https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.2 according to it, the normalization process should not touch %3A (:) and %2C (,) because they are reserved characters. See sporkmonger/addressable#472

@eWalthert
Copy link
Author

Dear @ClearlyClaire this was actually obseved by @charakterziffer@typo.social but as you can see, aslo the "other-way" https://typo.social/@charakterziffer/109644820380669393

Maybe one of you can report to support@eventbrite.com, to strengthen the awreness of the issue.
Also the additional information above mighe be helpful for their egineers.

@ClearlyClaire
Copy link
Contributor

@eWalthert the bug is in (a library used by) Mastodon, but it's different than what was originally reported and is more subtle; I'm gonna edit the issue's title and post accordingly

@ClearlyClaire ClearlyClaire changed the title ampersand in URL of og:image Overly-eager decoding of reserved chars when performing HTTP requests May 12, 2023
@eWalthert
Copy link
Author

This has been resolved with the last update 4.1.4 💙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants