Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced http with https in most URLs #2852

Merged
merged 5 commits into from Oct 12, 2021

Conversation

tomaarsen
Copy link
Member

Hello!

Pull request overview

  • Replaced http:// with https:// in certain URLs.
  • Sometimes replaced http:// with https://www..
  • Replaced some old links to nltk.github.com with links to https://www.nltk.org.
  • Replaced some old links to http://nltk.sourceforge.net with links to https://www.nltk.org.

Workflow

I followed this workflow to find out which URLs to update:

  1. Search for all http:// in the codebase.
  2. Open the old website, and open a version with https://.
  3. If both get the exact same result, then replace http:// with https://.
    (Keep in mind, this also means if both websites give a 404.)
  4. If the http:// doesn't load, but the https:// does, then use https://. (This ended up not happening in 500+ links)
  5. Only keep http:// if the http:// loads, and the https:// doesn't!

Exceptions

  • I didn't update e.g. LICENSE.txt.
  • Kept http:// in tweets and twitter.ipynb.

Other changes

I also removed this link, it seems to link to shady websites.

# For more information, see http://lilyx.net/pages/nltkjapanesecorpus.html

Notes

There are 454 more occurrences of http:// in the entire codebase, across just 35 files. Most of these are in the aforementioned twitter.ipynb and the unit test tweets. The remainder are across the entire codebase.
I also removed all uses of http in the website. Most important of this is Example Usage <http://www.nltk.org/howto> in the left menu of the website.

What now?

@stevenbird
Upon merging this PR, you might want to enforce HTTPS for the website. It's simply a button under Settings > Pages in https://github.com/nltk/nltk.github.com, called Enforce HTTPS. I would do this myself, but I don't have the permissions.

Future work

Perhaps we would benefit from making a script that crawls the codebase and queries all URLs. That way, we know which ones work and which ones don't. That said, that information in itself is not immediately useful. We would need to find internet archives of the old websites, or find the new locations of the original websites, if they moved.

cc: @iliakur @purificant

  • Tom Aarsen

@tomaarsen
Copy link
Member Author

It seems there was an obscure bug with the CoreNLP functionality that fails if ran with https://localhost instead of http://localhost. It has been resolved now. I don't intend to make further changes to this PR unless requested.

@purificant
Copy link
Member

@tomaarsen I think all localhost urls should remain http, it does not make sense to use https in a local context where domain name can not be verified.

@tomaarsen
Copy link
Member Author

I agree. I believe to have reverted all cases back to http://localhost like before.

Copy link
Member

@purificant purificant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I went through this PR but hadn't submitted the review, so all my comments are still pending, doing this now to see what is still relevant.

nltk/app/wordnet_app.py Outdated Show resolved Hide resolved
nltk/app/wordnet_app.py Outdated Show resolved Hide resolved
nltk/app/wordnet_app.py Outdated Show resolved Hide resolved
nltk/classify/maxent.py Outdated Show resolved Hide resolved
nltk/classify/senna.py Outdated Show resolved Hide resolved
web/dev/jenkins.rst Outdated Show resolved Hide resolved
web/news.rst Outdated Show resolved Hide resolved
web/news.rst Outdated Show resolved Hide resolved
web/news.rst Outdated Show resolved Hide resolved
web/team.rst Outdated Show resolved Hide resolved
This way, the website can still be recovered using an internet archive
@tomaarsen
Copy link
Member Author

tomaarsen commented Oct 12, 2021

In short, all websites that give a 404, 403, 401, DNS error, timeout, etc. were kept as what they were (generally http). This way, users can use an Internet Archive to find the original website again. Also, http was preserved in XML and HTML.
Beyond that, some other changes were made, e.g. removing links in senna.py and featgram.doctest.

Thanks a ton for reviewing this @purificant!!

@stevenbird stevenbird merged commit 90fa546 into nltk:develop Oct 12, 2021
@stevenbird
Copy link
Member

Thanks @tomaarsen and @purificant – this was long overdue!

@tomaarsen tomaarsen deleted the enhancement/https-ify branch October 12, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants