Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gousto.co.uk scraper broken #376

Closed
3 of 5 tasks
frazzyfin opened this issue Apr 30, 2021 · 12 comments
Closed
3 of 5 tasks

Gousto.co.uk scraper broken #376

frazzyfin opened this issue Apr 30, 2021 · 12 comments
Assignees
Labels

Comments

@frazzyfin
Copy link

Thanks for filing a bug report with us!

If your request is about a website that is not supported, please open a 'new scraper' issue request instead.

To help get the issue fixed, please fill in the information below.

Pre-filing checks

  • I have searched for open issues that report the same problem
  • I have checked that the bug affects the latest version of the library

The URL of the recipe(s) that are not being scraped correctly

The version of Python you're using

Python 3.8.5

The operating system of your environment

Ubuntu

The results you expect to see

After running scraper = scrape_me() on the url, then scraper.title(), i'd expect to see the title of the recipe - Chicken & Stuffing Sarnie With Plum Chutney

The results (including any Python error messages) that you are seeing

>>> scraper.title()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fraser/dev/recipe-scrapers/recipe_scrapers/plugins/exception_handling.py", line 63, in decorated_method_wrapper
    return decorated(self, *args, **kwargs)
  File "/home/fraser/dev/recipe-scrapers/recipe_scrapers/plugins/html_tags_stripper.py", line 74, in decorated_method_wrapper
    decorated_func_result = decorated(self, *args, **kwargs)
  File "/home/fraser/dev/recipe-scrapers/recipe_scrapers/plugins/normalize_string.py", line 33, in decorated_method_wrapper
    return normalize_string(decorated(self, *args, **kwargs))
  File "/home/fraser/dev/recipe-scrapers/recipe_scrapers/plugins/schemaorg_fill.py", line 46, in decorated_method_wrapper
    return decorated(self, *args, **kwargs)
  File "/home/fraser/dev/recipe-scrapers/recipe_scrapers/gousto.py", line 11, in title
    return self.soup.find("h1", {"class": "indivrecipe-title"}).get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

Can you write Python and would you like to help fix the scraper yourself? We'd be glad for your assistance! We can provide you with guidance and code review in return. If so, tick any of the relevant boxes below:

  • I'd like to try fixing this scraper myself
  • I'd like guidance to help me develop a fix
  • I'd prefer if the recipe-scrapers team try to fix this
@frazzyfin frazzyfin added the bug label Apr 30, 2021
@hhursev hhursev self-assigned this May 1, 2021
hhursev pushed a commit that referenced this issue Oct 10, 2021
* Update test HTML to live site

* Fix title

* Use schema for recipe & update test
@Nelinski
Copy link

Nelinski commented Dec 6, 2021

Just checking this is still on the list to fix as I'm still having issues with Gousto? Thanks!

@jayaddison
Copy link
Collaborator

@Nelinski thanks for checking - could you confirm the version of recipe-scrapers you're using, and whether you're seeing the same exception (AttributeError: 'NoneType' object has no attribute 'get_text') or whether there's something else going on too?

@Nelinski
Copy link

Nelinski commented Dec 7, 2021

I'm trying this via Mealie which uses this scraper. It looks like I'm getting:
AttributeError: 'NoneType' object has no attribute 'get'

Looks like Mealie is using version "13.7.0".

@jayaddison
Copy link
Collaborator

Ok, great! - any chance you could include an example URL or two? (that'd help replicate the error, and then we can track down the reason the get is failing)

@jayaddison
Copy link
Collaborator

Hmm, weird.. it looks like Gousto's site may no longer have schema.org JSON in the source; at least that's what I see when browsing one of the recipes myself.

Can anyone else confirm that too? (view source in your preferred browser is probably the easiest way; or by curl'ing or using Python to retrieve the source of one of those URLs)

@Nelinski
Copy link

Nelinski commented Dec 7, 2021

Looks like they're doing it via JS now rather than directly in the source as I can't see it, but it looks to validate OK here:
https://validator.schema.org/#url=https%3A%2F%2Fwww.gousto.co.uk%2Fcookbook%2Fchicken-recipes%2Fchicken-date-tamarind-curry

Edit: When looking at the source via schema.org, relevant snippet below:

</body>
</html>
<!-- Inserted by https://www.gousto.co.uk/cookbook/static/js/5.d02f4471.chunk.js -->
<script type="application/ld+json">
  {
    "@context": "http://schema.org/",
    "@type": "Recipe",
    "name": "Chicken, Date & Tamarind Curry With Kachumber",

@AdityaSoni19031997
Copy link

AdityaSoni19031997 commented Feb 2, 2022

Just curious, can't we directly fetch for "application/ld+json" while scraping?

@Nelinski
Copy link

@PatrickPierce Looks like you created the original scraper for Gousto, any ideas on this one?

@PatrickPierce
Copy link
Contributor

@PatrickPierce Looks like you created the original scraper for Gousto, any ideas on this one?

Unfortunately I do not. The original scraper has been redesigned to use schema over parsing the HTML. I can confirm that the issue still occurs with 13.20.0 and that schema validator detects the correct information.

There is an issue with the test, but I do not think that will make the parser fail.

        self.assertEqual(
            "https://test.example.com/", self.harvester_class.canonical_url()
        )

Test URL: https://www.gousto.co.uk/cookbook/pork-recipes/creamy-pork-tagliatelle

@hhursev
Copy link
Owner

hhursev commented Mar 17, 2022

The problem stems from gousto.co.uk having a "javascript detection" mechanism which make it so the html is not visible in it's entirety when fetched with simple requests.get() approach. I'll submit an ad-hoc solution this weekend and bump the version.

hhursev added a commit that referenced this issue Mar 19, 2022
@hhursev
Copy link
Owner

hhursev commented Mar 19, 2022

As of version 13.22.0 gousto.co.uk should be supported again. lmk in case of any problems @Nelinski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants