High performance: Store gets flooded when too many pages are crawled #28

happysalada · 2020-02-10T07:21:00Z

if I try to launch 100 page to be crawled (each with depth 5)
after a bit the Store process get flooded and starts dropping messages

2020-02-10 16:07:42.374 [debug] "Failed to fetch https://mystays.rwiths.net/r-withs/tfi0020a.do?GCode=mystays&ciDateY=2020&ciDateM=02&ciDateD=10&coDateY=2020&coDateM=02&coDateD=11&s1=0&s2=0&y1=0&y2=0&y3=0&y4=0&room=1&otona=4&hotelNo=38599&dataPattern=PL&cdHeyaShu=t2&planId=4114101&f_lang=ja, reason: checkout_timeout"

the upside of having a Registry is that the store is global and the crawler can be run from multiple machines.
The downside is that this single process will become a bottleneck for high performance.
Would you be open to using mnesia ? (fast, distributer, in memory db)
if you don't mind the distributed part, I would use an ets for the store, which should be able to handle more load.

the solution to this is to break the crawling of all those urls, and not send them all at the same time.

let me know if you are open to this, I'm open to putting a tentative PR

The text was updated successfully, but these errors were encountered:

fredwu · 2020-02-10T07:36:35Z

Hi, if you could issue a PR that would be awesome! 👍

happysalada · 2020-02-12T07:35:03Z

the actual problem comes from httpoinson, the underlying library for making the requests.
The checkout_time failure, means that the connection pool for making the requests is being flooded
edgurgel/httpoison#359

happysalada · 2020-02-22T00:54:16Z

checking how the library works, by using a Genserver.cast in the worker
https://github.com/fredwu/crawler/blob/master/lib/crawler/worker.ex#L20
all the requests are asynchronous, but if since the hackney pool size is limited, the workers won't find an available connection and requests will fail.
The surprising thing here, is that the http errors are seen as debug messages. Shouldn't they appear as errors or at least warning? (just wondering)
The other surprising behavior is that the user has to figure out looking at the logs, the proper rate limiting to be employed for requests not to fail because of a connection pool error. Perhaps here, making it httpoison configurable so you can pass options to use a particular pool?
(not sure what would be the ideal approach here, or if you agree with my reasonsing)

fredwu · 2020-02-24T07:41:35Z

Hi @happysalada, thanks for doing more investigation! To be honest I haven't had chance to use my library for a while so I don't remember much off the top of my head. I welcome PR fixes! :)

happysalada · 2020-02-24T07:43:37Z

I'm doing research on what the best options are to pass to hackney.
I'll let you know if I find something worth improving.
thanks for your reply

fredwu · 2023-09-28T18:06:17Z

So it's been a few years.... cough

I've just pushed up v1.2.0 to address memory leak.

Also, there's been some updates in httpoison and hackney too: edgurgel/httpoison#414

I couldn't reproduce this issue so I'm assuming it's resolved. Please feel free to reopen if there's more to discuss. :)

fredwu added the help wanted label Sep 28, 2023

fredwu closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High performance: Store gets flooded when too many pages are crawled #28

High performance: Store gets flooded when too many pages are crawled #28

happysalada commented Feb 10, 2020

fredwu commented Feb 10, 2020

happysalada commented Feb 12, 2020

happysalada commented Feb 22, 2020

fredwu commented Feb 24, 2020

happysalada commented Feb 24, 2020

fredwu commented Sep 28, 2023

High performance: Store gets flooded when too many pages are crawled #28

High performance: Store gets flooded when too many pages are crawled #28

Comments

happysalada commented Feb 10, 2020

fredwu commented Feb 10, 2020

happysalada commented Feb 12, 2020

happysalada commented Feb 22, 2020

fredwu commented Feb 24, 2020

happysalada commented Feb 24, 2020

fredwu commented Sep 28, 2023