Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Gousto.co.uk scraper. Closes #376 #511

Merged
merged 2 commits into from
Mar 19, 2022
Merged

Update Gousto.co.uk scraper. Closes #376 #511

merged 2 commits into from
Mar 19, 2022

Conversation

hhursev
Copy link
Owner

@hhursev hhursev commented Mar 19, 2022

No description provided.

@coveralls
Copy link

coveralls commented Mar 19, 2022

Coverage Status

Coverage increased (+0.06%) to 95.36% when pulling 7238bbd on gousto-scraper-fix into 2d1175a on main.

@hhursev hhursev merged commit e6fe2ae into main Mar 19, 2022
@hhursev hhursev deleted the gousto-scraper-fix branch March 19, 2022 16:19
class GoustoJson(AbstractScraper):
"""
Ad-hoc solution to https://github.com/hhursev/recipe-scrapers/issues/376
Let's see if it stands the test of time and reevaluate.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly time to discuss some of that re-evaluation?

It'd be nice to imagine a future where recipe webpages include relevant metadata on-page for user-agents to collect easily.

What's less clear to me is whether we can assume that that's happening already as a trend (allowing us to de-prioritize multi-request scrapers) or whether sites might become more evasive over time (meaning that we could help navigate towards the goal by permitting multi-request scrapers).

...with developer, dependency and ecosystem implications in each case.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in time I'm leaning towards Offline Use Only note in README.md and on scrape_me invocation.

Also, updating scrape_html to auto-pick the scraper based on the <link rel="canonical" .. > in the html.

In short, de-prioritizing multi-request scrapers.

Our package would aim to be good once it gets it's hands on a valid html from the sites listed. Not playing it clever bypassing browser/javascript checks or requirements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yep - reducing the interface to scrape_html only would seem to have a bunch of benefits.

It means we could require/enforce decoded str input, and we'd have fewer dependencies and modules related to networking.

The main drawback I have in mind is that rel="canonical" isn't found on all pages - so perhaps an optional (warn-if-different-to-canonical-URL-in-HTML) parameter to provide that when it isn't available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants