Crawler crashes with Invalid input #415

abbasharoon · 2018-01-05T15:06:31Z

Hi,

Thanks for the amazing package. It works like a breeze.
I am having a strange problem with it for a single site that I want to crawl. URL is http://lesleyevers.com
When it scans some pages from it, the crawler crashes at once with the following error:

0|www | at URI.p.href (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:1249:13)
0|www | at new URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:70:10
0|www | at URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:46:16)
0|www | at /var/www/node_modules/simplecrawler/lib/crawler.js:1744:24
0|www | at FetchQueue.oldestUnfetchedItem (/var/www/node_modules/simplecrawler/lib/queue.js:250:13)
0|www | at Crawler.crawl (/var/www/node_modules/simplecrawler/lib/crawler.js:1738:19)
0|www | at ontimeout (timers.js:386:14)
0|www | at tryOnTimeout (timers.js:250:5)
0|www | at Timer.listOnTimeout (timers.js:214:5)

I added the debug code given on the main page, it's last output before the error was:

0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/830/Lucille_on_cyn_1__31127.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/coffee-10.jpg
0|www | fetched 138 of 625 — 2 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/new-years-card.jpg

My SimpleCrawler version is 1.1.6.

I tried to get any helpful data for debugging it but I couldn't. Kindly let me know if anything else is required from my end.

Thanks :)

The text was updated successfully, but these errors were encountered:

konstantinblaesi · 2018-01-05T16:45:18Z

Can you provide a code sample that reliably reproduces the error? It would be really helpful to see the values of the variables protocol, hostname and port for the URL causing the crash.
A while ago I merged a change for URI.js that made it's URL validation (for hostname and port) stricter, but it was disabled by default a few days/weeks later, because some of it's user apparently abuse URI.js as a template library instead of URL validation/construction/tokenization :(
See this comment
To enable the stricter validation you can do this before starting your crawler:

const URI = require('urijs');
URI.preventInvalidHostname = true;

Please let us know if this solves the issue.
Adding another try/catch block in simplecrawler might help as well, but in my opinion this should not be necessary because faulty URLs shouldn't be queued in the first place.

abbasharoon · 2018-01-06T07:15:51Z

Hi @konstantinblaesi, thanks for the quick response.
I tried enabling the preventInvalidHostname but it didn't prevented the error from happening.
I tried catching the uri details via try/catch in crawler.js, following details were logged for the uri variable at line 1775.

{ [String: 'https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865']
  _string: 'https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865',
  _parts:
   { protocol: 'https',
     username: null,
     password: null,
     hostname: null,
     urn: null,
     port: null,
     path: '/cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg',
     query: '2662936180569856865',
     fragment: null,
     preventInvalidHostname: false,
     duplicateQueryParameters: false,
     escapeQuerySpace: true },
  _deferred_build: false }

I think it's because of the extra slash in the protocol that's causing the error.
Catching the error didn't stopped the process from crashing on another error down the line:

TypeError: undefined is not a valid argument for URI
    at new URI (/home/vagrant/node/node_modules/simplecrawler/node_modules/urijs/src/URI.js:54:15)
    at URI (/home/vagrant/node/node_modules/simplecrawler/node_modules/urijs/src/URI.js:46:16)
    at /home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:1793:28
    at FetchQueue.oldestUnfetchedItem (/home/vagrant/node/node_modules/simplecrawler/lib/queue.js:250:13)
    at Crawler.crawl (/home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:1769:17)
    at ontimeout (timers.js:365:14)
    at tryOnTimeout (timers.js:237:5)
    at Timer.listOnTimeout (timers.js:207:5)

I am not sure how to proceed on this issue. If I add link verification in the link discovery method, will it help prevent the error?

konstantinblaesi · 2018-01-06T17:46:45Z

As you can see from your logged details the problem with this URL

https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865

is that URI.js fails to detect the hostname if it contains a triple slash after the protocol/scheme. Simplecrawler later tries to construct a new URI.js object with an empty hostname which it was given by URI.js before. The construction works, but the .href() call on the new object fails. I've reported the issue to URI.js in medialize/URI.js#365. Let's wait for their opinion.

abbasharoon · 2018-01-07T07:54:16Z

Okay thanks a lot. I have downgraded to previous version as temporary measure. For time being it was a life savior.
I have one other issue with robots.txt, I think it will be better to post it as seperate issue

konstantinblaesi mentioned this issue Jan 6, 2018

Hostname detection fails for URLs starting with https:/// medialize/URI.js#365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler crashes with Invalid input #415

Crawler crashes with Invalid input #415

abbasharoon commented Jan 5, 2018

konstantinblaesi commented Jan 5, 2018

abbasharoon commented Jan 6, 2018

konstantinblaesi commented Jan 6, 2018

abbasharoon commented Jan 7, 2018

Crawler crashes with Invalid input #415

Crawler crashes with Invalid input #415

Comments

abbasharoon commented Jan 5, 2018

konstantinblaesi commented Jan 5, 2018

abbasharoon commented Jan 6, 2018

konstantinblaesi commented Jan 6, 2018

abbasharoon commented Jan 7, 2018