Overly-eager decoding of reserved chars when performing HTTP requests #24932

eWalthert · 2023-05-09T14:29:02Z

Detailed description

This is more general than preview cards and can be tracked down to Addressable::URI decoding percent-encoded reserved characters (sporkmonger/addressable#472).

Specifications

Mastodon 4.1.2, Ruby 2.7.0p0, PostfreSQL 15.2, Redis 5.0.7, all browsers and apps

The text was updated successfully, but these errors were encountered:

renchap · 2023-05-10T09:16:36Z

Can you please provide a reproduction, for example an URL we can embed into a post and see the issue?

eWalthert · 2023-05-10T11:55:12Z

Thank you @renchap for looking into this.

Here is a post:
https://typo.social/@lttrspc/110344236591977314
that links to:
https://letterspace39.eventbrite.com
that gets rewired to:
https://www.eventbrite.com/e/tickets-39-blast-foundry-launch-627086401307
and its og:image is linking to:
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7
that on Mastodon (and some Open Graph validators) results in:
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7
and leads to 403 Forbidden

renchap · 2023-05-10T12:51:22Z

Ok, I can reproduce.

According to W3C ampersand should work everywhere since 2014.

Do you have a pointer for this? & is HTML encoding, not URI encoding, so I am not sure where it is specified that URLs can be encoded using HTML encoding.

eWalthert · 2023-05-10T13:46:43Z

@renchap I was linked to this on stackoverflow:

https://en.wikipedia.org/wiki/Query_string#Web_forms

One of the original uses was to contain the content of an [HTML form](https://en.wikipedia.org/wiki/Form_(HTML)), also known as web form. In particular, when a form containing the fields field1, field2, field3 is submitted, the content of the fields is encoded as a query string as follows:

field1=value1&field2=value2&field3=value3...
The query string is composed of a series of field-value pairs.
Within each pair, the field name and value are separated by an [equals sign](https://en.wikipedia.org/wiki/Equals_sign), "=".
The series of pairs is separated by the [ampersand](https://en.wikipedia.org/wiki/Ampersand), "&"

I'm not saying &amp should work, but & as a separator should work.
But Mastodon transforms it into a &amp that breaks the URL.

(at least, this is as much I understand. Not a programmer.)

renchap · 2023-05-10T14:00:32Z

When you look at the EventBrite page you linked above (https://letterspace39.eventbrite.com), you have this in the HTML:

<meta property="og:image" content="https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7">

So the & character comes from their HTML, Mastodon does not transform it at all.

This seems incorrect to me and an issue from EventBrite, they should not HTML-encode the attribute value, as it is contained in a content tag attribute.

eWalthert · 2023-05-10T14:25:02Z

@renchap this might be browser or system dependent. Are you by any chance on Linux?

FireFox 112 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Opera 98 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Safari 14 Mac (outdated): (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Crome 112 Mac: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

FireFox 112 Win11: (inspect)
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

Edge 112 Win11: (Page Source) !!!
https://img.evbuc.com/https%3A%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format%2Ccompress&q=75&sharp=10&rect=0%2C0%2C2160%2C1080&s=e263a07d9865502aeaa826c331043fa7

renchap · 2023-05-11T08:29:01Z

Ok this is very weird. In Chrome, when using "View source", the content attribute has & (which is consistent with looking at the real HTTP request). But when using Inspect, the URL is decoded and contains the proper &.

I am not sure if this is in the HTML spec, or a common behaviour.

I think it all comes from a bug in React, where it HTML-encodes the content of those attributes: facebook/react#13838

NextJS issue, with a proposed workaround: vercel/next.js#2006

I am not sure if we need to handle this specific case in Mastodon, as this is not conforming to any spec I know. But on the other hand many sites are most probably showing this issue and dont have a real way to fix it.

@ClearlyClaire do you have an opinion here?

eWalthert · 2023-05-11T08:48:20Z

@renchap I’m not sure if this behavior is a bug.

For example; when looking at a Hex-scrambled email address in the inspector, it appears like a proper email.
Only when looking at the page source I’m seeing the inf coding.
(Kirby does this as an anti-spam-bot measure)

And yes, 20% of OG-image validators don’t show the picture.
But since the popular social-media platforms don’t have issues with these URLs, Eventbrite does not seem to care.

ClearlyClaire · 2023-05-12T07:49:06Z

@ClearlyClaire do you have an opinion here?

I was confused because initially reading the report it sounded like Eventbrite was using plain & and Mastodon was transforming them into &. This is not what is happening, but instead, Eventbrite uses & ~~and Mastodon does not interpret character references to replace that to a simple &~~.

My understanding of HTML is that HTML attributes can contain character references that should be interpreted as such. ~~I'll look into why that isn't happening in this case and see what can be done.~~

ClearlyClaire · 2023-05-12T08:22:49Z

Upon investigation, I find that Mastodon does interpret character references properly, but that the image still can't be fetched from a different reason.

From my testing, it appears that Mastodon is actually trying to fetch the following URL:
https://img.evbuc.com/https:%2F%2Fcdn.evbuc.com%2Fimages%2F503170239%2F16956914287%2F1%2Foriginal.20230428-131449?w=1000&auto=format,compress&q=75&sharp=10&rect=0,0,2160,1080&s=e263a07d9865502aeaa826c331043fa7

Note that the URL percent-encoding is not exactly the same as in the correct link. I will be investigating why.

@eWalthert how did you observe the “Actual behavior” from your original post?

ClearlyClaire · 2023-05-12T09:03:24Z

Note that the URL percent-encoding is not exactly the same as in the correct link. I will be investigating why.

This is because we “normalize” the given URL:

mastodon/app/lib/request.rb

Line 31 in bf3ebeb

@url = Addressable::URI.parse(url).normalize

But… stopping doing that would not prevent the issue as HTTP.rb is doing the same thing (httprb/http#654):
https://github.com/httprb/http/blob/4e0696c6596123c122389402234de057fcc68550/lib/http/request.rb#L88

Disabling normalization completely is possible but pretty awkward:

diff --git a/app/lib/request.rb b/app/lib/request.rb
index 4bde6fc911..98e8427f3d 100644
--- a/app/lib/request.rb
+++ b/app/lib/request.rb
@@ -28,7 +28,7 @@ class Request
     raise ArgumentError if url.blank?
 
     @verb        = verb
-    @url         = Addressable::URI.parse(url).normalize
+    @url         = Addressable::URI.parse(url)
     @http_client = options.delete(:http_client)
     @allow_local = options.delete(:allow_local)
     @options     = options.merge(socket_class: use_proxy? || @allow_local ? ProxySocket : Socket)
@@ -92,7 +92,7 @@ class Request
     end
 
     def http_client
-      HTTP.use(:auto_inflate).timeout(TIMEOUT.dup).follow(max_hops: 3)
+      HTTP.use(:auto_inflate).use(:normalize_uri => { normalizer: ->(x) { x } }).timeout(TIMEOUT.dup).follow(max_hops: 3)
     end
   end

I honestly don't know if “normalizing” the URL is advisable or if we are doing it right.

EDIT: The proper RFC to cover that would probably be https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.2 according to it, the normalization process should not touch %3A (:) and %2C (,) because they are reserved characters. See sporkmonger/addressable#472

eWalthert · 2023-05-12T09:29:38Z

Dear @ClearlyClaire this was actually obseved by @charakterziffer@typo.social but as you can see, aslo the "other-way" https://typo.social/@charakterziffer/109644820380669393

Maybe one of you can report to support@eventbrite.com, to strengthen the awreness of the issue.
Also the additional information above mighe be helpful for their egineers.

ClearlyClaire · 2023-05-12T09:34:26Z

@eWalthert the bug is in (a library used by) Mastodon, but it's different than what was originally reported and is more subtle; I'm gonna edit the issue's title and post accordingly

eWalthert · 2023-07-10T21:40:40Z

This has been resolved with the last update 4.1.4 💙

eWalthert added the bug Something isn't working label May 9, 2023

ClearlyClaire changed the title ~~ampersand in URL of og:image~~ Overly-eager decoding of reserved chars when performing HTTP requests May 12, 2023

eWalthert closed this as completed Jul 10, 2023

ClearlyClaire reopened this Jul 28, 2023

ClearlyClaire mentioned this issue Jul 28, 2023

Do not normalize URL before fetching it #26219

Merged

Gargron mentioned this issue Aug 14, 2023

v4.2.0 raises more bad URI error due to encoding/decoding #26494

Open

ClearlyClaire mentioned this issue Apr 22, 2024

Update dependency http to '~> 5.2.0' #30027

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overly-eager decoding of reserved chars when performing HTTP requests #24932

Overly-eager decoding of reserved chars when performing HTTP requests #24932

eWalthert commented May 9, 2023 •

edited by ClearlyClaire

renchap commented May 10, 2023

eWalthert commented May 10, 2023

renchap commented May 10, 2023

eWalthert commented May 10, 2023 •

edited

renchap commented May 10, 2023 •

edited

eWalthert commented May 10, 2023

renchap commented May 11, 2023

eWalthert commented May 11, 2023

ClearlyClaire commented May 12, 2023 •

edited

ClearlyClaire commented May 12, 2023

ClearlyClaire commented May 12, 2023 •

edited

eWalthert commented May 12, 2023

ClearlyClaire commented May 12, 2023

eWalthert commented Jul 10, 2023

Overly-eager decoding of reserved chars when performing HTTP requests #24932

Overly-eager decoding of reserved chars when performing HTTP requests #24932

Comments

eWalthert commented May 9, 2023 • edited by ClearlyClaire

Steps to reproduce the problem

Expected behaviour

Actual behaviour

Detailed description

Specifications

renchap commented May 10, 2023

eWalthert commented May 10, 2023

renchap commented May 10, 2023

eWalthert commented May 10, 2023 • edited

renchap commented May 10, 2023 • edited

eWalthert commented May 10, 2023

renchap commented May 11, 2023

eWalthert commented May 11, 2023

ClearlyClaire commented May 12, 2023 • edited

ClearlyClaire commented May 12, 2023

ClearlyClaire commented May 12, 2023 • edited

eWalthert commented May 12, 2023

ClearlyClaire commented May 12, 2023

eWalthert commented Jul 10, 2023

eWalthert commented May 9, 2023 •

edited by ClearlyClaire

eWalthert commented May 10, 2023 •

edited

renchap commented May 10, 2023 •

edited

ClearlyClaire commented May 12, 2023 •

edited

ClearlyClaire commented May 12, 2023 •

edited