Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make URL case insensitive. #1350

Closed
wants to merge 5 commits into from
Closed

Conversation

roygold7
Copy link

@roygold7 roygold7 commented Oct 4, 2018

Now URL's starting with hTTp or Https will be captured by the regular expression.

Now URL's starting with hTTp or Https will be captured by the regular expression.
@roygold7 roygold7 closed this Oct 4, 2018
@roygold7 roygold7 reopened this Oct 4, 2018
@styfle
Copy link
Member

styfle commented Oct 4, 2018

@roygold7 Thanks for the PR! Can you add a unit test for an uppercase url?

@UziTech Do you think this will affect the email replacement?

@davisjam I don't think this will introduce ReDOS but please review just in case.

lib/marked.js Outdated Show resolved Hide resolved
@davisjam
Copy link
Contributor

davisjam commented Oct 4, 2018

Case-insensitivity does not affect this regex's behavior w.r.t. ReDoS. Though that's not universally true, for some regexes it would matter.

@UziTech
Copy link
Member

UziTech commented Oct 4, 2018

Should other regexps also be case-insensitive? I'm thinking of the block.html, block._tag

Maybe a better question is, are there any regexes that need to be case-sensitive? or should we find a way to make all regexes case-insensitive by default?

@@ -606,7 +606,7 @@ inline.pedantic = merge({}, inline.normal, {
inline.gfm = merge({}, inline.normal, {
escape: edit(inline.escape).replace('])', '~|])').getRegex(),
_extended_email: /[A-Za-z0-9._+-]+(@)[a-zA-Z0-9-_]+(?:\.[a-zA-Z0-9-_]*[a-zA-Z0-9])+(?![-_])/,
url: /^((?:ftp|https?):\/\/|www\.)(?:[a-zA-Z0-9\-]+\.?)+[^\s<]*|^email/,
url: /^((?:ftp|https?):\/\/|www\.)(?:[a-zA-Z0-9\-]+\.?)+[^\s<]*|^email/i,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is case-insensitive, could you also change a-zA-Z to a-z?

@styfle
Copy link
Member

styfle commented Oct 4, 2018

@UziTech I think it already is invoked as case-insensitive

block.html = edit(block.html, 'i')

@UziTech
Copy link
Member

UziTech commented Oct 4, 2018

looks like this could be solved easier by just changing line 618 from

inline.gfm.url = edit(inline.gfm.url)

to

inline.gfm.url = edit(inline.gfm.url, 'i')

@Martii
Copy link
Contributor

Martii commented Oct 4, 2018

Careful on this PR... See https://url.spec.whatwg.org/#url-writing and the keyword "not" floating around in there for certain schemes.

@UziTech
Copy link
Member

UziTech commented Oct 4, 2018

we are only matching ftp and https? schemes which seem to be fine as case-insensitive

@Martii
Copy link
Contributor

Martii commented Oct 4, 2018

Here's at least one where I am thinking it should be double checked and possibly added to the tests:

a URL-scheme string that is not an ASCII case-insensitive match for a special scheme, followed by U+003A (:) and a relative-URL string

Removing the double negative (emphasized font in immediate previous quote) ... it seems like this use case may need case sensitivity:

a URL-scheme string that is an ASCII case-sensitive match for a special scheme, followed by U+003A (:) and a relative-URL string

In this browser the Address bar does take a mix of upper case absolute URLs but then copying it out it always lower cases it... so that's how this browser handles it at least.

@UziTech
Copy link
Member

UziTech commented Oct 5, 2018

The spec says it "must be one of the following" and ftp/http(s) matches

a URL-scheme string that is an ASCII case-insensitive match for a special scheme and not an ASCII case-insensitive match for "file", followed by U+003A (:) and a scheme-relative-special-URL string

@UziTech
Copy link
Member

UziTech commented Oct 5, 2018

also looks like github doesn't mind if it is not lowercase.

hTtP://example.com

@Martii
Copy link
Contributor

Martii commented Oct 5, 2018

also looks like github doesn't mind if it is not lowercase.

Doesn't mean that GFM is following output guidelines from the W3 for well written HTML code conformance.

The decisions are these:

  1. Fix a nuisance and change the incomplete specification to a potentially larger breaking change (this PR and a move to 0.6.0 of marked with all use case scenarios). This will trickle up the chain to all sanitizers and top-level code (which is why I'm commenting here). Quite frankly in my circles writing hTTp, or whatever variant, is poor coding and communication practice but the W3 says it's okay in certain circumstances. *shrugs*. Wonder how node handles this with plucking individual pieces out in URL API?
  2. Check the input to see if there is a relative url and conditionalize the insensitive vs. sensitive operation to spec... potentially smaller breaking (only a use case scenario) but additive towards the compliance of the spec instead of subtraction. e.g. standard I/O... check the Input for potentially bad input and output the correct syntax specification. The text can always include the multi-cased http but the url itself should be normalized in this scenario.
  3. Keep it as is ... still incomplete specification however unbreaking atm.

We'll accommodate regardless if this goes through but filtering will be affected with items 1 and 2. This is probably why item 3 exists is it is a balance perhaps. Hence why I said be careful and haven't bothered to vote prematurely either way.

@UziTech
Copy link
Member

UziTech commented Oct 5, 2018

Doesn't mean that GFM is following output guidelines from the W3 for well written HTML code conformance.

This only affects inline url when gfm is turned on (which means they want it to act like GitHub).

@Martii
Copy link
Contributor

Martii commented Oct 5, 2018

act like GitHub

One place this could be addressable is in github/cmark however there's no real unit test, that I know of, since the W3 spec mentions bases and relative urls which is the use case exception. I'm not against this PR just noting what effects will happen with it. Filtering is where we will need to adjust to this forcing it to lowercase most likely (for security integrity). Still need to test this (when I get back to dev station next week) in node with the URL API and see if it breaks it natively with "parting it out"... it may not affect it at all or could affect it adversely. Quite honestly I've never seen anyone do a mix of upper and lower case schemes in another language including JavaScript so it caught my primary attention. GFM will most likely be normalized if it isn't. Should probably test in Commonmark too but again don't know if they have a base url floating around to test against from my understanding. :)

@styfle
Copy link
Member

styfle commented Oct 7, 2018

This works just fine:

const str = 'hTtP://example.com';
const url = new URL('hTtP://example.com');
console.log('node\t', process.version);
console.log('input\t', str);
console.log('url\t', url.href);

https://glot.io/snippets/f5if7pn2lf

@Martii
Copy link
Contributor

Martii commented Oct 7, 2018

A little more on the test range:

$ node -v && node -e 'var x = new URL("/doc/this.html", "hTtP://example.com"); console.log(x.protocol)'
v10.11.0
http:

... this passes (pre PR and without this dep)... and by the spec it seems like it shouldn't but that's probably a node issue to be raised.

Haven't had a chance to test the legacy URL API yet and probably older versions of node that are still active... little later perhaps if I have time this evening.


Expect this one to pass since it uses an absolute url and legacy API (again without the PR and this dep):

$ node -v && node -e 'var url = require("url"); var x = url.parse("hTtP://example.com/doc/this.html"); console.log(x.protocol)'
v10.11.0
http:

So basically a GH itself i/o page for GH parsing would be helpful when base is specified... would be helpful for a final test... still contemplating if a local projects i/o page could handle this.

Will test our sanitizer shortly to see if it's handling it the same way as node... our code is using case insensitive tests so we don't have to change that part... just checking sanitizer to be ultra-safe since there could be catastrophic results with across-the-pond projects and up/down-stream.


Pass with filtering on our sanitizer (post PR and with this dep)... albeit this isn't all sanitizers. Doesn't care if it's http or HtTp still strips the href if not allowed at this time.


Point still being more tests are still recommended to be added in case it complies and in case it doesn't. e.g. be careful. :)

Apologies for the reedits... attempting to make this more clear for the masses and to keep the noise level down.

@UziTech UziTech mentioned this pull request Dec 5, 2018
4 tasks
@styfle
Copy link
Member

styfle commented Dec 6, 2018

Closing in favor of #1384

@styfle styfle closed this Dec 6, 2018
zhenalexfan pushed a commit to zhenalexfan/MarkdownHan that referenced this pull request Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants