Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WhoScored] Unable to select English locale #440

Open
Gibranium opened this issue Dec 14, 2023 · 12 comments
Open

[WhoScored] Unable to select English locale #440

Gibranium opened this issue Dec 14, 2023 · 12 comments
Labels
WhoScored Issue or pull request related to the WhoScored scraper

Comments

@Gibranium
Copy link

Sorry to bring this back, but I am not able to scrape whoscored after months.
I explain it all, since April I had it all functioning properly. Then I changed my pc from an Intel MacBook Pro to a Mac mini m2, I've downloaded again Tor via Homebrew and set anaconda properly with a specific environment to use only soccerdata and dependencies. Still I've not been able to scrape a single file from Whoscored, while FBREF scraping - at least - works flawlessly. I've tried all the things that were recommended in precedently opened iterations of this problem, the only thing I've not tried till now is to use a VPN because I'd really like to not spend money right now to make it work. If anyone is able to help me in making it work feel free to contact me personally on twitter: @gualanodavide.
Thanks a lot to anyone

Screenshot 2023-12-14 alle 16 31 12
Screenshot 2023-12-14 alle 16 31 22

@probberechts
Copy link
Owner

Can you try to run the code in non-headless mode and check what happens in your browser window? Does it say that your IP is blocked or show a captcha?

import soccerdata as sd
ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True)
leagues = ws.read_leagues()

If that's the case it's a problem with the undetected-chromedriver library, not with soccerdata. You can test with:

import undetected_chromedriver as uc
driver = uc.Chrome(headless=False, use_subprocess=False)
driver.get('https://www.whoscored.com/')

You might find a solution if the issue tracker of the undetected-chromedriver library.

@Gibranium
Copy link
Author

I did not have any problem in my browser window, it opened whoscored and didn't ask for a captcha, but I think since I'm in Italy that it doesn't find the same names in the link as he request in the code, so it fails. Am I right, and how can I solve it?

Screenshot 2023-12-14 alle 20 28 21

@probberechts
Copy link
Owner

Oh, but now you have a different error. You got past the error in your first comment. Can you share the "tiers.json" file in "/Users/davidegualona/soccerdata/data/WhoScored"?

@Gibranium
Copy link
Author

Yes, of course.

Here it is:

tiers.json

@probberechts
Copy link
Owner

You were right, the country names are in Italian in your "tiers.json" file. One option is to add the Italian names in the config/league_dict.json file (see https://soccerdata.readthedocs.io/en/latest/howto/custom-leagues.html). For example,

{
  "ENG-Premier League": {
    "WhoScored": "Inghilterra - Premier League"
  }
}

You might experience more problems in other parts of the code though.

Alternatively, you could try to set the default language of your browser to English or configure selenium accordingly (see https://stackoverflow.com/questions/55150118/trouble-modifying-the-language-option-in-selenium-python-bindings).

Let me know what works.

@Gibranium
Copy link
Author

I've tried the first one but the code immediately presents another problem, so I think It's not viable. For the other two: I've tried to change the language of Chrome and Safari, but It doesn't resolve it because in the search page the result already is in Italian, for the adjustment via your link I don't think I have the necessary ability to pull a functioning adjustment. I've tried with some help from ChatGPT but in 1 hour we couldn't find a solution, because apparently this:

driver = webdriver.Chrome(chrome_options=options)

needs to be this:

driver = webdriver.Chrome(options=options)

in order to apply the options, but still I don't know to make the driver work into the scraping part. Nonetheless ChatGPT made me try this:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

Set up the WebDriver with language preference

options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {'intl.accept_languages': 'en,en_US'})
driver = webdriver.Chrome(options=options)

Navigate to the WhoScored page using Selenium

driver.get("https://www.whoscored.com/") # Replace with the actual URL

Extract the HTML content after the page has loaded

html_content = driver.page_source

Continue with requests and BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')

In order to see if things could work to later melt the soccerdata part with this adjustment, and I found that even though I can make him load in English after a second Whoscored refresh itself and load in Italian.
So, either I am not good enough to pull this or I need to go and do a NordVPN subscription, am I right?

@probberechts
Copy link
Owner

You can also try to redirect to the English version by simulating a click on the language menu at the top left.

import soccerdata as sd
ws = sd.WhoScored("ENG-Premier League", "2223", headless=False, no_cache=True)
ws._driver.get("https://www.whoscored.com/")
ws._driver.execute_script("location = 'https://whoscored.com/'")
leagues = ws.read_leagues()

@Gibranium
Copy link
Author

It does what it is supposed to do, but nonetheless Whoscored refresh itself and load in Italian

@probberechts
Copy link
Owner

Is there any way in which you can switch to English when browsing the website manually?

@Gibranium
Copy link
Author

There's a toggle in which you can choose the language, but if I set EN it switches automatically back to IT

@Gibranium
Copy link
Author

Anyway, I've resolved my subscribing to NordVPN, right now it seems worth the amount of money for the effort.
I'd ask you only another thing - then you can close the issue if you need to - for [WhoScored] Ignore cached events file if empty #420, the improvement has been already added to soccerdata or we should write the enhancement by ourselves? In that case I should do it where? Thank you very much for all the help.

@probberechts
Copy link
Owner

Ok, great! If the locale is hard-coded based on IP location I think the only possible fixes are indeed translating some parts of the implementation or using a VPN.

#420 is not yet released. If you can't wait for the next release, you can install the latest build from test.pypi.

@probberechts probberechts added the WhoScored Issue or pull request related to the WhoScored scraper label Jan 1, 2024
@probberechts probberechts changed the title Whoscored problem [WhoScored] Unable to select English locale Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WhoScored Issue or pull request related to the WhoScored scraper
Projects
None yet
Development

No branches or pull requests

2 participants