Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Crawler crashes with Invalid input #415

Open
abbasharoon opened this issue Jan 5, 2018 · 4 comments
Open

Crawler crashes with Invalid input #415

abbasharoon opened this issue Jan 5, 2018 · 4 comments

Comments

@abbasharoon
Copy link

Hi,

Thanks for the amazing package. It works like a breeze.
I am having a strange problem with it for a single site that I want to crawl. URL is http://lesleyevers.com
When it scans some pages from it, the crawler crashes at once with the following error:

0|www | at URI.p.href (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:1249:13)
0|www | at new URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:70:10
0|www | at URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:46:16)
0|www | at /var/www/node_modules/simplecrawler/lib/crawler.js:1744:24
0|www | at FetchQueue.oldestUnfetchedItem (/var/www/node_modules/simplecrawler/lib/queue.js:250:13)
0|www | at Crawler.crawl (/var/www/node_modules/simplecrawler/lib/crawler.js:1738:19)
0|www | at ontimeout (timers.js:386:14)
0|www | at tryOnTimeout (timers.js:250:5)
0|www | at Timer.listOnTimeout (timers.js:214:5)

I added the debug code given on the main page, it's last output before the error was:

0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/830/Lucille_on_cyn_1__31127.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/coffee-10.jpg
0|www | fetched 138 of 625 — 2 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/new-years-card.jpg

My SimpleCrawler version is 1.1.6.

I tried to get any helpful data for debugging it but I couldn't. Kindly let me know if anything else is required from my end.

Thanks :)

@konstantinblaesi
Copy link
Contributor

Can you provide a code sample that reliably reproduces the error? It would be really helpful to see the values of the variables protocol, hostname and port for the URL causing the crash.
A while ago I merged a change for URI.js that made it's URL validation (for hostname and port) stricter, but it was disabled by default a few days/weeks later, because some of it's user apparently abuse URI.js as a template library instead of URL validation/construction/tokenization :(
See this comment
To enable the stricter validation you can do this before starting your crawler:

const URI = require('urijs');
URI.preventInvalidHostname = true;

Please let us know if this solves the issue.
Adding another try/catch block in simplecrawler might help as well, but in my opinion this should not be necessary because faulty URLs shouldn't be queued in the first place.

@abbasharoon
Copy link
Author

Hi @konstantinblaesi, thanks for the quick response.
I tried enabling the preventInvalidHostname but it didn't prevented the error from happening.
I tried catching the uri details via try/catch in crawler.js, following details were logged for the uri variable at line 1775.

{ [String: 'https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865']
  _string: 'https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865',
  _parts:
   { protocol: 'https',
     username: null,
     password: null,
     hostname: null,
     urn: null,
     port: null,
     path: '/cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg',
     query: '2662936180569856865',
     fragment: null,
     preventInvalidHostname: false,
     duplicateQueryParameters: false,
     escapeQuerySpace: true },
  _deferred_build: false }

I think it's because of the extra slash in the protocol that's causing the error.
Catching the error didn't stopped the process from crashing on another error down the line:

TypeError: undefined is not a valid argument for URI
    at new URI (/home/vagrant/node/node_modules/simplecrawler/node_modules/urijs/src/URI.js:54:15)
    at URI (/home/vagrant/node/node_modules/simplecrawler/node_modules/urijs/src/URI.js:46:16)
    at /home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:1793:28
    at FetchQueue.oldestUnfetchedItem (/home/vagrant/node/node_modules/simplecrawler/lib/queue.js:250:13)
    at Crawler.crawl (/home/vagrant/node/node_modules/simplecrawler/lib/crawler.js:1769:17)
    at ontimeout (timers.js:365:14)
    at tryOnTimeout (timers.js:237:5)
    at Timer.listOnTimeout (timers.js:207:5)

I am not sure how to proceed on this issue. If I add link verification in the link discovery method, will it help prevent the error?

@konstantinblaesi
Copy link
Contributor

As you can see from your logged details the problem with this URL

https:///cdn.shopify.com/s/files/1/0333/9621/files/forbiiden_banner_small.jpg?2662936180569856865

is that URI.js fails to detect the hostname if it contains a triple slash after the protocol/scheme. Simplecrawler later tries to construct a new URI.js object with an empty hostname which it was given by URI.js before. The construction works, but the .href() call on the new object fails. I've reported the issue to URI.js in medialize/URI.js#365. Let's wait for their opinion.

@abbasharoon
Copy link
Author

Okay thanks a lot. I have downgraded to previous version as temporary measure. For time being it was a life savior.
I have one other issue with robots.txt, I think it will be better to post it as seperate issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants