Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Crash on invalid robots.txt redirect #363

Closed
ghost opened this issue Apr 3, 2017 · 9 comments
Closed

Crash on invalid robots.txt redirect #363

ghost opened this issue Apr 3, 2017 · 9 comments

Comments

@ghost
Copy link

ghost commented Apr 3, 2017

Hi!

Thanks for a very useful module. I'm unfortunately experiencing an exception when trying to parse a url where the robots.txt download redirects to an invalid url. The source url is http://99ranch.com/; the robots.txt link redirects to https://99ranch.com:robots.txt, which is not a valid link (since robots.txt is not a valid port number).

Unfortunately, the result is a stack trace:

net.js:979
      throw new RangeError('"port" option should be >= 0 and < 65536: ' + port);
      ^

RangeError: "port" option should be >= 0 and < 65536: robots.txt
    at lookupAndConnect (net.js:979:13)
    at TLSSocket.Socket.connect (net.js:954:5)
    at Object.exports.connect (_tls_wrap.js:1056:12)
    at Agent.createConnection (https.js:83:22)
    at Agent.createSocket (_http_agent.js:195:26)
    at Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:212:16)
    at Object.request (http.js:26:10)
    at Object.request (https.js:206:15)
    at Crawler.getRobotsTxt (/.../node_modules/simplecrawler/lib/crawler.js:603:32)

In my opinion, the callback should've been called with an error object, just like in all the other cases. Unfortunately, it doesn't seem like there's any validation of the url when fetching robots.txt.

@fredrikekelund
Copy link
Collaborator

Thanks for reporting an issue, @erwinw! Could you please supply a small reproducible test case? I tested with the following code and the crawler seemed to churn away happily (at least for the first minute).

"use strict";

var Crawler = require("simplecrawler"),
    moment = require("moment");

function log() {
    var args = Array.from(arguments),
        now = moment().format("hh:mm:ss");

    args.unshift(now);
    console.log.apply(console, args);
}

var crawler = new Crawler("https://99ranch.com/");

// When this was set to true (as it is by default), the crawler stopped after the first request,
// since the robots.txt disallows all URL's for all user agents
// crawler.respectRobotsTxt = false;

crawler.on("fetcherror", function(queueItem, response) {
    log("fetcherror", queueItem.url);
});

crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
    log("fetchcomplete", queueItem.url);
});

crawler.on("crawlstart", function() {
    log("fetchstart");
});

crawler.on("complete", function() {
    log("complete");
});

crawler.start();

@ghost
Copy link
Author

ghost commented Apr 7, 2017

Thanks for picking up on this, @fredrikekelund. The issue occurs on the http:// link which (in the wrong way) redirect to https:// .

@fredrikekelund
Copy link
Collaborator

Sorry about the painfully slow response here. The cause of this problem is the fact that this particular site returns a faulty Location header: https://99ranch.com: (notice the trailing colon). But you are quite right that we don't do any error handling when constructing redirect URL's in Crawler#getRobotsTxt. However, the thing is that it's not urijs that's throwing this error, it's node's http library. urijs ought to throw on this however, so I have reported an issue with them here medialize/URI.js#344

I will add a try/catch block around the URI construction in Crawler#getRobotsTxt redirection handling logic, and emit a robotstxterror event when things go wrong. But for this particular case, we won't be able to resolve this until we've gotten to the bottom of the upstream issue.

More soon!

@konstantinblaesi
Copy link
Contributor

konstantinblaesi commented Jul 26, 2017

@fredrikekelund

I came accross the same problem when crawling faulty URLs like e.g.

http://server:port/foo/bar

which can occur in forums / discussion threads / wikis.

The stack trace is a little bit different:

RangeError: "port" option should be >= 0 and < 65536: port
    at lookupAndConnect (net.js:960:13)
    at Socket.connect (net.js:935:5)
    at Agent.exports.connect.exports.createConnection (net.js:74:35)
    at Agent.createSocket (_http_agent.js:195:26)
    at Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:159:16)
    at Object.exports.request (http.js:31:10)
    at Crawler.getRobotsTxt (***\node_modules\simplecrawler\lib\crawler.js:610:32)
    at ***\node_modules\simplecrawler\lib\crawler.js:1749:29
    at FetchQueue.oldestUnfetchedItem (***\node_modules\simplecrawler\lib\queue.js:250:13)
    at Crawler.crawl (***\node_modules\simplecrawler\lib\crawler.js:1729:19)

Would you filter bad URLs like these in Crawler.cleanExpandResources() ?

@ghost
Copy link
Author

ghost commented Jul 26, 2017

Thanks @fredrikekelund!

@konstantinblaesi
Copy link
Contributor

@fredrikekelund with uri.js' current form there is nothing to catch here, right?
As far as I understood your comments you consider making a pull request to introduce proper validation for their Constructor/Factory used to create new URI instances? So in case that pull request happens, your try/catch makes sense and I assume that you do not want to introduce too much URL validation in simplecrawler itself?
According to uri.js' code these functions can throw a TypeError, but I think they only do if you use them as a setter. It looks like uri.js does 0 validation on construction or uri instances of when using the getters only ?

uri.js functions that can throw (according to docs):

protocol()
scheme()
host()
port()
authority()
domain()
subdomain()
tld()

Are there better alternatives validation wise then uri.js?

@fredrikekelund
Copy link
Collaborator

@konstantinblaesi with the latest version of uri.js - the constructor throws both on invalid hostnames and on invalid ports, is that correct?

In that case this issue should be resolved since we now explicitly depend on the latest version of uri.js (or upwards) and we already have the try/catch blocks around the places where we call the uri constructor. Right?

@konstantinblaesi
Copy link
Contributor

konstantinblaesi commented Aug 13, 2017

@fredrikekelund this should definately be resolved now. Ports have to be of type Integer and hostnames cannot be empty if the protocol specified when constructing the URI.js instance was one of these https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L248
Hostname validation happens in https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L1034 and port validation in https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L1059
There are also a few testcases added in URI.js to ensure this, see my actual changes to URI.js medialize/URI.js@0657ad8
I have also verfied that the simplecrawler no longer crashes if it comes across URLs like http://hostname:port or http:// in sitemaps/pages. I think simplecrawler has a try/catch block in all cases where new URLs are discovered and tested.

@fredrikekelund
Copy link
Collaborator

Great! I'll cut a new release in a few hours then 🐿️

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants