Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic re-attempts that initiate new webdriver instances eventually cause performance issues #396

Open
TimelessUsername opened this issue Oct 5, 2023 · 2 comments
Labels
performance Performance question Further information is requested WhoScored Issue or pull request related to the WhoScored scraper

Comments

@TimelessUsername
Copy link

Hi,

I'm not 100% on all of this yet, and further investigation is required, but at least when running multiple concurrent scraping instances, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background and borderline freeze the pc with 100% cpu usage. It might be related to running the code on headless=False, as WhoScore currently requires, or something else entirely. It is probably not related to multiple concurrent processes, but that is where it is very apparent, eventually anyway. Further information coming later when I investigate more, but if anyone else has had these issues, at least there is a note of them now. This might not be a soccerdata issue at all (or only), but I will also try to work out a fix. Fixing the issue manually every so many hours is easy if you just kill the processes, but long scraping sessions end with a near frozen pc for me currently.

@probberechts
Copy link
Owner

I just would like to point out a few things here.

when running multiple concurrent scraping instances ...

Running multiple concurrent scraping instances is not supported. Moreover, I am a strong advocate of scraping responsibly. Therefore, I do my best to respect the website's scraping policies. For example, FBref only allows up to 20 requests in a minute. SocerData respects this by implementing a delay between requests. If you run multiple concurrent instances, you are no longer respecting this. Overloading a site also makes the user experience worse for anyone using that site (think: slow response times). It gives all forms of web scraping a bad reputation.

... every so many hours ...

European law allows scraping web data as long as (a) you don’t scrape a ‘substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database’ and you don’t re-use it (meaning basically selling or publishing it); or (b) scraping falls under TDM exception; or (c) you’ve received an appropriate license. If you scrape data for multiple hours, you are probably violating the first clause.

, the re-attempts that cause new webdriver instances to be initiated will eventually (most probably) cause some "ghost" instances of google chrome to run on the background

A webdriver instance is always properly closed before a new one is initialized, so I do not really see where these ghost instances would originate from.

if hasattr(self, "_driver"):
self._driver.quit()

@TimelessUsername
Copy link
Author

TimelessUsername commented Oct 14, 2023

I'm quite confident I have not done anything illegal, so that aside, I have not been able to reliably reproduce the issue but can confirm that it is very much present with a single instance. More info will come when I'm able to pinpoint the issue better.

Edit: It is real confusing as to how it happens precisely because of the above, but it might be related to the code erroring out, or while the program is going trough a list of years for example. It does seem these ghost instances often pop up while loading things purely from memory.

@probberechts probberechts added question Further information is requested performance Performance WhoScored Issue or pull request related to the WhoScored scraper labels Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance question Further information is requested WhoScored Issue or pull request related to the WhoScored scraper
Projects
None yet
Development

No branches or pull requests

2 participants