Crawler crashes with Invalid input #415
Comments
Can you provide a code sample that reliably reproduces the error? It would be really helpful to see the values of the variables protocol, hostname and port for the URL causing the crash.
Please let us know if this solves the issue. |
Hi @konstantinblaesi, thanks for the quick response.
I think it's because of the extra slash in the protocol that's causing the error.
I am not sure how to proceed on this issue. If I add link verification in the link discovery method, will it help prevent the error? |
As you can see from your logged details the problem with this URL
is that URI.js fails to detect the hostname if it contains a triple slash after the protocol/scheme. Simplecrawler later tries to construct a new URI.js object with an empty hostname which it was given by URI.js before. The construction works, but the .href() call on the new object fails. I've reported the issue to URI.js in medialize/URI.js#365. Let's wait for their opinion. |
Okay thanks a lot. I have downgraded to previous version as temporary measure. For time being it was a life savior. |
Hi,
Thanks for the amazing package. It works like a breeze.
I am having a strange problem with it for a single site that I want to crawl. URL is http://lesleyevers.com
When it scans some pages from it, the crawler crashes at once with the following error:
0|www | at URI.p.href
(/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:1249:13)
0|www | at new URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:70:10
0|www | at URI (/var/www/node_modules/simplecrawler/node_modules/urijs/src/URI.js:46:16)
0|www | at /var/www/node_modules/simplecrawler/lib/crawler.js:1744:24
0|www | at FetchQueue.oldestUnfetchedItem (/var/www/node_modules/simplecrawler/lib/queue.js:250:13)
0|www | at Crawler.crawl (/var/www/node_modules/simplecrawler/lib/crawler.js:1738:19)
0|www | at ontimeout (timers.js:386:14)
0|www | at tryOnTimeout (timers.js:250:5)
0|www | at Timer.listOnTimeout (timers.js:214:5)
I added the debug code given on the main page, it's last output before the error was:
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/830/Lucille_on_cyn_1__31127.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 136 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/831/Lucille_on_Cyn_3__70665.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 137 of 625 — 1 open requests, 0 open listeners
0|www | fetchheaders http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 0 open requests, 0 open listeners
0|www | downloadprevented http://cdn6.bigcommerce.com/s-zwl80h/images/stencil/60x90/products/218/832/Lucille_on_Cyn_4__97434.1509683204.jpg?c=2
0|www | fetched 138 of 625 — 1 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/coffee-10.jpg
0|www | fetched 138 of 625 — 2 open requests, 0 open listeners
0|www | fetchstart https://store-zwl80h.mybigcommerce.com/product_images/uploaded_images/new-years-card.jpg
My SimpleCrawler version is 1.1.6.
I tried to get any helpful data for debugging it but I couldn't. Kindly let me know if anything else is required from my end.
Thanks :)
The text was updated successfully, but these errors were encountered: