Crash on invalid robots.txt redirect #363

ghost · 2017-04-03T18:00:11Z

Hi!

Thanks for a very useful module. I'm unfortunately experiencing an exception when trying to parse a url where the robots.txt download redirects to an invalid url. The source url is http://99ranch.com/; the robots.txt link redirects to https://99ranch.com:robots.txt, which is not a valid link (since robots.txt is not a valid port number).

Unfortunately, the result is a stack trace:

net.js:979
      throw new RangeError('"port" option should be >= 0 and < 65536: ' + port);
      ^

RangeError: "port" option should be >= 0 and < 65536: robots.txt
    at lookupAndConnect (net.js:979:13)
    at TLSSocket.Socket.connect (net.js:954:5)
    at Object.exports.connect (_tls_wrap.js:1056:12)
    at Agent.createConnection (https.js:83:22)
    at Agent.createSocket (_http_agent.js:195:26)
    at Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:212:16)
    at Object.request (http.js:26:10)
    at Object.request (https.js:206:15)
    at Crawler.getRobotsTxt (/.../node_modules/simplecrawler/lib/crawler.js:603:32)

In my opinion, the callback should've been called with an error object, just like in all the other cases. Unfortunately, it doesn't seem like there's any validation of the url when fetching robots.txt.

The text was updated successfully, but these errors were encountered:

fredrikekelund · 2017-04-07T07:46:18Z

Thanks for reporting an issue, @erwinw! Could you please supply a small reproducible test case? I tested with the following code and the crawler seemed to churn away happily (at least for the first minute).

"use strict";

var Crawler = require("simplecrawler"),
    moment = require("moment");

function log() {
    var args = Array.from(arguments),
        now = moment().format("hh:mm:ss");

    args.unshift(now);
    console.log.apply(console, args);
}

var crawler = new Crawler("https://99ranch.com/");

// When this was set to true (as it is by default), the crawler stopped after the first request,
// since the robots.txt disallows all URL's for all user agents
// crawler.respectRobotsTxt = false;

crawler.on("fetcherror", function(queueItem, response) {
    log("fetcherror", queueItem.url);
});

crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
    log("fetchcomplete", queueItem.url);
});

crawler.on("crawlstart", function() {
    log("fetchstart");
});

crawler.on("complete", function() {
    log("complete");
});

crawler.start();

ghost · 2017-04-07T08:07:49Z

Thanks for picking up on this, @fredrikekelund. The issue occurs on the http:// link which (in the wrong way) redirect to https:// .

fredrikekelund · 2017-07-25T16:19:39Z

Sorry about the painfully slow response here. The cause of this problem is the fact that this particular site returns a faulty Location header: https://99ranch.com: (notice the trailing colon). But you are quite right that we don't do any error handling when constructing redirect URL's in Crawler#getRobotsTxt. However, the thing is that it's not urijs that's throwing this error, it's node's http library. urijs ought to throw on this however, so I have reported an issue with them here medialize/URI.js#344

I will add a try/catch block around the URI construction in Crawler#getRobotsTxt redirection handling logic, and emit a robotstxterror event when things go wrong. But for this particular case, we won't be able to resolve this until we've gotten to the bottom of the upstream issue.

More soon!

…rection logic More in #363

konstantinblaesi · 2017-07-26T15:38:39Z

@fredrikekelund

I came accross the same problem when crawling faulty URLs like e.g.

http://server:port/foo/bar

which can occur in forums / discussion threads / wikis.

The stack trace is a little bit different:

RangeError: "port" option should be >= 0 and < 65536: port
    at lookupAndConnect (net.js:960:13)
    at Socket.connect (net.js:935:5)
    at Agent.exports.connect.exports.createConnection (net.js:74:35)
    at Agent.createSocket (_http_agent.js:195:26)
    at Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:159:16)
    at Object.exports.request (http.js:31:10)
    at Crawler.getRobotsTxt (***\node_modules\simplecrawler\lib\crawler.js:610:32)
    at ***\node_modules\simplecrawler\lib\crawler.js:1749:29
    at FetchQueue.oldestUnfetchedItem (***\node_modules\simplecrawler\lib\queue.js:250:13)
    at Crawler.crawl (***\node_modules\simplecrawler\lib\crawler.js:1729:19)

Would you filter bad URLs like these in Crawler.cleanExpandResources() ?

ghost · 2017-07-26T17:12:47Z

Thanks @fredrikekelund!

konstantinblaesi · 2017-07-27T12:23:25Z

@fredrikekelund with uri.js' current form there is nothing to catch here, right?
As far as I understood your comments you consider making a pull request to introduce proper validation for their Constructor/Factory used to create new URI instances? So in case that pull request happens, your try/catch makes sense and I assume that you do not want to introduce too much URL validation in simplecrawler itself?
According to uri.js' code these functions can throw a TypeError, but I think they only do if you use them as a setter. It looks like uri.js does 0 validation on construction or uri instances of when using the getters only ?

uri.js functions that can throw (according to docs):

protocol()
scheme()
host()
port()
authority()
domain()
subdomain()
tld()

Are there better alternatives validation wise then uri.js?

fredrikekelund · 2017-08-10T15:11:38Z

@konstantinblaesi with the latest version of uri.js - the constructor throws both on invalid hostnames and on invalid ports, is that correct?

In that case this issue should be resolved since we now explicitly depend on the latest version of uri.js (or upwards) and we already have the try/catch blocks around the places where we call the uri constructor. Right?

konstantinblaesi · 2017-08-13T12:03:23Z

@fredrikekelund this should definately be resolved now. Ports have to be of type Integer and hostnames cannot be empty if the protocol specified when constructing the URI.js instance was one of these https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L248
Hostname validation happens in https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L1034 and port validation in https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L1059
There are also a few testcases added in URI.js to ensure this, see my actual changes to URI.js medialize/URI.js@0657ad8
I have also verfied that the simplecrawler no longer crashes if it comes across URLs like http://hostname:port or http:// in sitemaps/pages. I think simplecrawler has a try/catch block in all cases where new URLs are discovered and tested.

fredrikekelund · 2017-08-13T12:25:26Z

Great! I'll cut a new release in a few hours then 🐿️

fredrikekelund added a commit that referenced this issue Jul 25, 2017

Added try/catch block around URL parsing in Crawler#getRobotsTxt redi…

9f4e4ec

…rection logic More in #363

fredrikekelund mentioned this issue Aug 10, 2017

URNs do not have any generally defined hierarchical components #379

Closed

fredrikekelund closed this as completed Aug 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on invalid robots.txt redirect #363

Crash on invalid robots.txt redirect #363

ghost commented Apr 3, 2017

fredrikekelund commented Apr 7, 2017

ghost commented Apr 7, 2017

fredrikekelund commented Jul 25, 2017

konstantinblaesi commented Jul 26, 2017 •

edited

ghost commented Jul 26, 2017

konstantinblaesi commented Jul 27, 2017

fredrikekelund commented Aug 10, 2017

konstantinblaesi commented Aug 13, 2017 •

edited

fredrikekelund commented Aug 13, 2017

Crash on invalid robots.txt redirect #363

Crash on invalid robots.txt redirect #363

Comments

ghost commented Apr 3, 2017

fredrikekelund commented Apr 7, 2017

ghost commented Apr 7, 2017

fredrikekelund commented Jul 25, 2017

konstantinblaesi commented Jul 26, 2017 • edited

ghost commented Jul 26, 2017

konstantinblaesi commented Jul 27, 2017

fredrikekelund commented Aug 10, 2017

konstantinblaesi commented Aug 13, 2017 • edited

fredrikekelund commented Aug 13, 2017

konstantinblaesi commented Jul 26, 2017 •

edited

konstantinblaesi commented Aug 13, 2017 •

edited