Crash on invalid robots.txt redirect #363
Comments
Thanks for reporting an issue, @erwinw! Could you please supply a small reproducible test case? I tested with the following code and the crawler seemed to churn away happily (at least for the first minute). "use strict";
var Crawler = require("simplecrawler"),
moment = require("moment");
function log() {
var args = Array.from(arguments),
now = moment().format("hh:mm:ss");
args.unshift(now);
console.log.apply(console, args);
}
var crawler = new Crawler("https://99ranch.com/");
// When this was set to true (as it is by default), the crawler stopped after the first request,
// since the robots.txt disallows all URL's for all user agents
// crawler.respectRobotsTxt = false;
crawler.on("fetcherror", function(queueItem, response) {
log("fetcherror", queueItem.url);
});
crawler.on("fetchcomplete", function (queueItem, responseBuffer, response) {
log("fetchcomplete", queueItem.url);
});
crawler.on("crawlstart", function() {
log("fetchstart");
});
crawler.on("complete", function() {
log("complete");
});
crawler.start(); |
Thanks for picking up on this, @fredrikekelund. The issue occurs on the http:// link which (in the wrong way) redirect to https:// . |
Sorry about the painfully slow response here. The cause of this problem is the fact that this particular site returns a faulty I will add a try/catch block around the URI construction in More soon! |
I came accross the same problem when crawling faulty URLs like e.g. which can occur in forums / discussion threads / wikis. The stack trace is a little bit different:
Would you filter bad URLs like these in Crawler.cleanExpandResources() ? |
Thanks @fredrikekelund! |
@fredrikekelund with uri.js' current form there is nothing to catch here, right? uri.js functions that can throw (according to docs): protocol() Are there better alternatives validation wise then uri.js? |
@konstantinblaesi with the latest version of uri.js - the constructor throws both on invalid hostnames and on invalid ports, is that correct? In that case this issue should be resolved since we now explicitly depend on the latest version of uri.js (or upwards) and we already have the try/catch blocks around the places where we call the |
@fredrikekelund this should definately be resolved now. Ports have to be of type Integer and hostnames cannot be empty if the protocol specified when constructing the URI.js instance was one of these https://github.com/medialize/URI.js/blob/v1.18.12/src/URI.js#L248 |
Great! I'll cut a new release in a few hours then 🐿️ |
Hi!
Thanks for a very useful module. I'm unfortunately experiencing an exception when trying to parse a url where the robots.txt download redirects to an invalid url. The source url is
http://99ranch.com/
; the robots.txt link redirects tohttps://99ranch.com:robots.txt
, which is not a valid link (sincerobots.txt
is not a valid port number).Unfortunately, the result is a stack trace:
In my opinion, the callback should've been called with an error object, just like in all the other cases. Unfortunately, it doesn't seem like there's any validation of the url when fetching robots.txt.
The text was updated successfully, but these errors were encountered: