Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improved URL regex #230

Merged
merged 2 commits into from Nov 7, 2021
Merged

fix: improved URL regex #230

merged 2 commits into from Nov 7, 2021

Conversation

amadejpapez
Copy link
Collaborator

@amadejpapez amadejpapez commented Nov 6, 2021

⚠ Pull Requests not made with this template will be automatically closed πŸ”₯

Prerequisites

Why do we need this pull request?

This should fix a few issues we were seeing with URLs. I have went through the regex and modified some parts. There may still be some cases but with this changes I saw a lot better results.

Also added more Examples and https://www.google.com now matches fully.

I have written an explanation for regex from start of the URL till the end to make it easier and quicker to review. Also give feedback, so it can get even better. :)

(?i)(?:(?:https?|ftp):\/\/)?(?:\S+:\S+@)?(?:[a-z0-9-_~]+\.)*[a-z0-9-]{1,62}\.(?:COM|IO|BLOG|ORG|TECH)(?::\d{2,5})?(?:\/[a-z0-9-_~.]+)*(?:[?#]\S*)*\/?

  • http/https/ftp is still optional at the beggining.
  • Subdomains can contain [a-z0-9-_~] with . in-between. Previously this part was matched as whole which caused something like wwww.....google.com to be valid.
  • Domain name can contain [a-z0-9-] and is 1-62 characters long.
  • Valid TLD from our list.
  • There can be a port number specified.
  • Path can contain [a-z0-9-_~.] with / in-between.
  • If there is ? or # characters it basically matches to everything after it until there is a space or line break. I do not think this characters can get any more limited.

What GitHub issues does this fix?

Copy / paste of output

Please copy and paste the output of PyWhat with your new addition using an example that tests this addition below:

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2021

Codecov Report

Merging #230 (42491e8) into main (071a962) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #230   +/-   ##
=======================================
  Coverage   92.60%   92.60%           
=======================================
  Files          15       15           
  Lines        1217     1217           
=======================================
  Hits         1127     1127           
  Misses         90       90           

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 071a962...42491e8. Read the comment docs.

@bee-san bee-san merged commit a5a4a3b into main Nov 7, 2021
@bee-san bee-san deleted the improve-url branch November 7, 2021 11:16
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change url generation script

@amadejpapez
Copy link
Collaborator Author

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

@ghost
Copy link

ghost commented Nov 7, 2021

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

Oh, that is great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants