Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parsing single-quoted URL strings #20

Closed
mre opened this issue Apr 19, 2021 · 5 comments · Fixed by #21
Closed

Error when parsing single-quoted URL strings #20

mre opened this issue Apr 19, 2021 · 5 comments · Fixed by #21

Comments

@mre
Copy link
Contributor

mre commented Apr 19, 2021

There is a bug with single-quoted URLs.

Input:

'https://example.org''

Expected:

https://example.org

Actually extracted link:

https://example.org''

Downstream issue in lychee: lycheeverse/lychee#230

@robinst
Copy link
Owner

robinst commented Apr 26, 2021

Hmm this is an interesting one. Note that GitHub's own autolinking does the same thing: https://www.google-analytics.com/analytics.js','ga

A couple of different ideas to fix this:

  1. Check if there's a quote before the scheme, and if yes, stop the URL once we encounter one of those quotes. That would mean 'https://example.org'' would not include the quotes, but something like 'https://example.com, https://example.org','foo would still include the quotes in the second URL. Maybe that's good enough? It would fix your specific example.
  2. If the first character after the first ' is some kind of delimiter like , or ; or . (what else?), stop the URL. That would even get the 'https://example.com, https://example.org','foo example right, but maybe it's too restrictive?

Would love to hear your thoughts. Do you have any test cases for lychee where we could see if one of the options works better?

@mre
Copy link
Contributor Author

mre commented Apr 26, 2021

Note that GitHub's own autolinking does the same thing

We are in good company. 😆

Option 1 sounds like a good idea to me. I'd say that 'https://example.com, https://example.org','foo would indicate a syntax error anyway; so perhaps we can ignore that case (for now). There are quite a few quote characters, though. Also, how would we handle escaped quotes (e.g. \'https://example.org'? My suggestion is to create a static array of "quote characters" (or use a crate if we find one) and not care about edge-cases too much right now.

Option 2 sounds more complicated because we could easily miss delimiters or edge-cases. For instance https://example.org'() is a valid URL according to RFC 1738 (Also see this StackOverflow answer) — even though the special characters are not encoded. We might run into other issues down the road.

Do you have any test cases for lychee where we could see if one of the options works better?

Apart from the Google Analytics link above, no. I'll keep an eye out for more.

@mre
Copy link
Contributor Author

mre commented Apr 26, 2021

I've updated the PR at #21 according to option 1.

robinst added a commit that referenced this issue May 18, 2021
Fix parsing quoted strings (#20)
@robinst
Copy link
Owner

robinst commented May 18, 2021

@mre
Copy link
Contributor Author

mre commented May 18, 2021

Yay. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants