Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charset=Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters in RegExp #415

Open
codefactor opened this issue Jun 7, 2022 · 2 comments

Comments

@codefactor
Copy link

codefactor commented Jun 7, 2022

Steps to Reproduce:

  1. Have an HTML payload where server gives response header content-type: text/html;charset=Shift_JIS
  2. Include the URI.js file with a script tag, use a compressed version of URI.js (not sure if same issue happens on the uncompressed one)

Unfortunately I can't find an easy way to give a link for this easily, but if it's necessary I could produce one maybe with codesandbox.

Expected:

The Javascript include should run, there should be no errors in the console

Actual:

The Javascript fails to parse with an error in the logs:

Uncaught SyntaxError: Unexpected token ':'

Root Cause:

There are non-ASCII characters inside of a Regular Expression in a couple places, example:

URI.find_uri_expression = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;

The non-ASCII characters «»“”‘’ are interpreted differently when the charset is set to Shift_JIS on the HTML page as a response header, and it causes the regular expression not to be closed properly, running into the next lines making a syntax error in the middle of the JSON. The same behavior is seen in Firefox and Chrome, I have not checked Edge.

Proposed solution:

Don't use non-ASCII characters which are unsafe when charsets are changed on the page, instead use a String that will be constructed with escaped characters:

  URI.find_uri_expression = new RegExp("\\b((?:[a-z][\\w-]+:(?:\\/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}\\/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))", "ig");

One other place:

trim: /[`!()\[\]{};:'".,<>?«»“”„‘’]+$/,

Could update to this:

    trim: new RegExp("[`!()\\[\\]{};:'\".,<>?\xab\xbb\u201c\u201d\u201E\u2018\u2019]+$"),

These are 2 places, there might be more.

@codefactor
Copy link
Author

To facilitate - here is my attempt at a PR:
#416

@codefactor codefactor changed the title charset= Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters charset=Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters Jun 8, 2022
@codefactor codefactor changed the title charset=Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters charset=Shift_JIS will fail to parse URI.js with a syntax error caused by non-ASCII characters in RegExp Jun 8, 2022
@codefactor
Copy link
Author

As an update - from our side, we have other resources that might require UTF-8 charset, so we will fix this issue from our side by consistently using UTF-8 charset.

However, it still might be a good idea to have the Javascript file to contain ASCII characters only so that this syntax error wouldn't come up if for whatever reason the page gets switched to Japanese charset.

However, this support does objectively increase the size of the file by a few bytes - so it's a little bit of a trade off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant