Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent-decoding entire URL components is not valid #180

Open
karwa opened this issue Nov 6, 2022 · 4 comments
Open

Percent-decoding entire URL components is not valid #180

karwa opened this issue Nov 6, 2022 · 4 comments

Comments

@karwa
Copy link

karwa commented Nov 6, 2022

These operations are not valid:

normalize-url/index.js

Lines 178 to 183 in 3c7235e

// Decode URI octets
if (urlObject.pathname) {
try {
urlObject.pathname = decodeURI(urlObject.pathname);
} catch {}
}

normalize-url/index.js

Lines 243 to 245 in 3c7235e

try {
urlObject.search = decodeURIComponent(urlObject.search);
} catch {}

Since URLs are a textual format, certain characters have semantic meaning. Percent-encoding can be used to escape those characters. For example, if we want a single path component named "AC/DC", we'll have a problem, because "/" can mean a path separator:

/music/bands/AC/DC
             ^^^^^ - 2 components! ❌

So instead, we have to escape the use of "/" within name "AC/DC":

/music/bands/AC%2FDC
             ^^^^^^^ - 1 component ✅

If you percent-decode the entire path string, we irreversibly lose the information that "AC/DC" was supposed to be a single path component.

Instead, the correct way to do this is to split the component (still escaped) up in to its constituent parts, decode each component, escape any characters with semantic meaning, and join them up again. For the path, that means breaking it up in to path components and ensuring "/" and "" characters are escaped again in each component before you rebuild the path string. For the query, it means doing the same for each key and value (not each key-value pair - they need to be broken down in to their smallest subcomponents).

@sindresorhus
Copy link
Owner

// @oswaldosalazar @ludofischer

@karwa
Copy link
Author

karwa commented Nov 6, 2022

Ah my bad - the path example is okay because decodeURI won't unescape some reserved characters, like "/", so it will keep the path component as "AC%2FDC". But for the query decoding, decodeURIComponent really does just unescape everything:

var url = new URL("http://example/?show=Tom%26Jerry&episode=3");

url.href;
// 'http://example/?show=Tom%26Jerry&episode=3'
//                          ^^^ - ❗️
url.searchParams.get("show");
// 'Tom&Jerry'

url.search = decodeURIComponent(url.search);

url.href;
// 'http://example/?show=Tom&Jerry&episode=3'
//                          ^ - ❗️
url.searchParams.get("show");
// 'Tom'

@ludofischer
Copy link
Contributor

As far as I remember decodeURIcomponent was introduced because some people complained that nomalize-url would break their URLs when the search parameters contained a forward slash (it would encode the forward slash as a side effect of sorting). As I see it, the library should not change the encoding unless the user asks for it. Since sorting is responsible via searchParams.sort() for changing the encoding, could the solution be to sort them by hand? Or is this more complicated than it looks?
Interesting, it looks like even server-side frameworks that do file-based routing end up with issues with decodeURIComponent. sveltejs/kit#7577

@ludofischer
Copy link
Contributor

The problem is that URL.searchParams gives you the parameter already decoded, so we can't rely on that to preserve the original encoding. We can't use decodeURI instead of decodeURIComponent because decodeURI does not bring back the forward slash (it keeps encoded as %2F).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants