Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High performance: Store gets flooded when too many pages are crawled #28

Closed
happysalada opened this issue Feb 10, 2020 · 6 comments
Closed

Comments

@happysalada
Copy link
Contributor

if I try to launch 100 page to be crawled (each with depth 5)
after a bit the Store process get flooded and starts dropping messages

2020-02-10 16:07:42.374 [debug] "Failed to fetch https://mystays.rwiths.net/r-withs/tfi0020a.do?GCode=mystays&ciDateY=2020&ciDateM=02&ciDateD=10&coDateY=2020&coDateM=02&coDateD=11&s1=0&s2=0&y1=0&y2=0&y3=0&y4=0&room=1&otona=4&hotelNo=38599&dataPattern=PL&cdHeyaShu=t2&planId=4114101&f_lang=ja, reason: checkout_timeout"

the upside of having a Registry is that the store is global and the crawler can be run from multiple machines.
The downside is that this single process will become a bottleneck for high performance.
Would you be open to using mnesia ? (fast, distributer, in memory db)
if you don't mind the distributed part, I would use an ets for the store, which should be able to handle more load.

the solution to this is to break the crawling of all those urls, and not send them all at the same time.

let me know if you are open to this, I'm open to putting a tentative PR

@fredwu
Copy link
Owner

fredwu commented Feb 10, 2020

Hi, if you could issue a PR that would be awesome! 👍

@happysalada
Copy link
Contributor Author

the actual problem comes from httpoinson, the underlying library for making the requests.
The checkout_time failure, means that the connection pool for making the requests is being flooded
edgurgel/httpoison#359

@happysalada
Copy link
Contributor Author

checking how the library works, by using a Genserver.cast in the worker
https://github.com/fredwu/crawler/blob/master/lib/crawler/worker.ex#L20
all the requests are asynchronous, but if since the hackney pool size is limited, the workers won't find an available connection and requests will fail.
The surprising thing here, is that the http errors are seen as debug messages. Shouldn't they appear as errors or at least warning? (just wondering)
The other surprising behavior is that the user has to figure out looking at the logs, the proper rate limiting to be employed for requests not to fail because of a connection pool error. Perhaps here, making it httpoison configurable so you can pass options to use a particular pool?
(not sure what would be the ideal approach here, or if you agree with my reasonsing)

@fredwu
Copy link
Owner

fredwu commented Feb 24, 2020

Hi @happysalada, thanks for doing more investigation! To be honest I haven't had chance to use my library for a while so I don't remember much off the top of my head. I welcome PR fixes! :)

@happysalada
Copy link
Contributor Author

I'm doing research on what the best options are to pass to hackney.
I'll let you know if I find something worth improving.
thanks for your reply

@fredwu
Copy link
Owner

fredwu commented Sep 28, 2023

So it's been a few years.... cough

I've just pushed up v1.2.0 to address memory leak.

Also, there's been some updates in httpoison and hackney too: edgurgel/httpoison#414

I couldn't reproduce this issue so I'm assuming it's resolved. Please feel free to reopen if there's more to discuss. :)

@fredwu fredwu closed this as completed Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants