Ideas for developer experience improvements #617

jayaddison · 2022-10-01T18:08:20Z

jayaddison · 2022-10-01T23:07:32Z

Adding one more item: debugging plugin-related issues is tricky at the moment, and I think the reason is that it's not clear/transparent what plugins are enabled for each method call.

They're important because they catch and handle various forms of noisy web-based input for various method calls (title, ingredients, instructions in particular), but they're challenging to reason about and may produce unexpected behaviour at runtime.

Labelling every method with multiple decorators would look spammy, so I don't think that'd be a great approach. We should also bear in mind that plugins are developer-customizable -- callers may wish to opt-in or opt-out of various post-processing steps. And also we shouldn't introduce unnecessary overhead -- but nor should we allow bad/unsafe data on the basis of technical minimalism.

Taking HTML unescaping as an example: at the moment I think we HTML unescape twice for fields that we expect may contain HTML content. As far as I can tell, there's one particular site which incorrectly double-escapes HTML entities in their schema.org content, and that's why we do this. Perhaps the default should be to HTML unescape once, but allow decoration with html_unescape(rounds=2) or similar to override that per-site. And if a performance-sensitive caller felt very confident that a particular scraper required no unescaping, they could somehow override method calls to state html_unescape(rounds=0).

Basically I think that the default implementation should be "safe and correct" but with the opportunity to skip processing steps for power-users (by which I mean sophisticated power-data-users really -- sites with enough data to know where and when it makes sense to omit a step for performance reasons).

What this implies, I think, is a pipeline for each individual method. Perhaps taken to logical extremes it means that there is no "recipe scrapers", but instead a "recipe title scraper", "recipe instructions scraper", ... - and each is a defined pipeline for a {website, timerange, conditions} set. With the goal being to achieve near-100% (and accurate) coverage of known web content across all websites and timeranges.

I don't think we should attempt that yet - Python classes that wrap entire sites makes much more sense in terms of user convenience and developer attention (context-switching). But basically each method here is codifying "here's where we think that the remote data provider has put this particular item of information for a given URL", followed by "and here are the steps we wish to apply to convert that into something that we believe is safe and accurate for our users".

If we become good enough at that, we can "upstream" any errors and oddities back to the original source and basically create a positive feedback loop that reduces the number of exceptions required (while still ideally detecting them when they occur).

jayaddison · 2022-10-06T15:37:57Z

A few more notes for a developer guide:

It'd be worth including recommendations about what the output of methods -- particularly instructions -- should look like
FAQ section?
Guidance about exception handling and how to debug scrapers during development

And another idea: should we have a similar code review / maintainer guide?

bfcarpio · 2022-10-12T06:29:39Z

I'm going to add some random slightly thought out things:

Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.
Networking code moved to scrape_me: If scrape_me is the officially recommended way of using this package then why do we put so much code into the scraper constructors? It might help our async friends if they can write a single "networking" adapter they insert into the factory.
Replace requests with urllib: Not really thought out, but we don't use many features other than basic GET.
Use poetry: Personal bias and I know there was pushback before. I just don't like pip's version management.
Switch to pyproject.toml: Gotta keep up with how the language is evolving and this will clean up our codebase a touch.

@jayaddison What's the scraper that incorrectly double-escapes HTML? That might be a good individual issue.

jayaddison · 2022-10-12T16:34:40Z

Remove schema based scrapers that have no overloads

I mostly agree with this, with a few small notes:

This could make incremental mypy progress easier, because we'd have fewer methods to check & update
Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers
We do still need a hostname-to-scraper mapping somewhere

Networking code moved to scrape_me

Thinking about scraping as a pipeline could help here. There's content retrieval (HTML/WARC/etc), then parsing, then extraction of metadata, and then mapping the metadata to output format(s) (python method calls, JSON, ...).

Scraper developers should only really have to care about the extraction part (and only when schema / microdata / foo doesn't already extract what they need). How we most easily represent that in Python classes I'm not sure. At the moment the retrieval and extraction steps are slightly bundled together.

Replace requests with urllib

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Use poetry

I'm 50/50 on that. I agree the version management and tracking could be useful. At the moment (after some lxml support issues in #629) I'm more in the mindset of removing dependencies where possible (maybe I should focus on removing the ones that I added, first..)

Switch to pyproject.toml

Agreed, makes sense. Note: we do have a MANIFEST.in file that I think is still used, let's try not to forget about that during migration.

jayaddison · 2022-10-12T16:48:42Z

What's the scraper that incorrectly double-escapes HTML?

Unfortunately I'm not sure I kept a note of that. From a quick search, I think a few sites are affected (seriouseats, homechef, food52, ...). It's typically in recipe-related element contents, not the entire page (a developer adding an extra htmlencode(...) for safety, for example.

bfcarpio · 2022-10-13T01:25:21Z

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

We do still need a hostname-to-scraper mapping somewhere

What for? If a scraper can entirely be handled via wild_mode then wouldn't the client use AbstractScraper? I've forgotten if that's the correct type that's returned.

At the moment the retrieval and extraction steps are slightly bundled together.

I would think of scrape_me as the default, easy to use, batteries included retrieval implementation. If someone wants to use async code, or have their input be rabbitmq then they can write their own handler and pass it to a dedicated scraper class. I'm assuming here that the data is the entire HTML content of the page so extraction can be as normal.

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Oh, then never mind. It's not that big of a deal.

jayaddison · 2022-10-13T10:37:32Z

Remove schema based scrapers that have no overloads

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

A hypothetical scenario: let's say that a schema-supported scraper becomes nothing more than a class with a host method.

In that case, I don't think we want new pull requests with template-generated scrapers that have a complete set of self.schema.* method calls.

(at the risk of overcomplicating things: perhaps ideally we'd want the generator to attempt to call each schema method, and only include methods for fields that were missing entirely from the input HTML schema and microdata)

We do still need a hostname-to-scraper mapping somewhere

What for?

A couple of reasons I can think of - potentially resolvable:

Unit tests assert on the host() value. We could determine it from the canonical URL (if the HTML contains one) although not in a @classmethod (we'd need the page data for that approach)
Backwards compatibility: the fact that SCRAPERS is a top-level public export from the module

jayaddison · 2022-10-13T11:51:07Z

And after a bit more thought, two more reasons for keeping a hostname-to-scraper mapping:

For sites that don't include structured recipe metadata on each page, we need a way to discover the relevant scraper to use from an input URL
For sites where we want to apply augmentations/corrections to metadata retrieved from the page, then we need a way to override the default wild_mode results - and that also means we need to identify the relevant scraper to use

(those are both fairly important use-cases, I think)

hhursev · 2022-10-16T18:19:17Z

How about we do the following in '22:

remove the "plugins" idea altogether
remove v12_settings
in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags (opposed to the current where you set environment variable that points to a "settings" file)
remove requests. (And always lean towards less dependencies going forwards)
switch to pyproject.toml
simplify issues template (by half in terms of chars just to have a clear aim)

In a separate issue:

Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

jayaddison · 2022-10-16T20:12:41Z

Sounds ambitious, and good :)

About the plugins: before removing them, would it make sense to move them onto the AbstractScraper as a default list? People could still override them as necessary per-scraper, or in third-party libraries by subclassing, and I think it'd result in clearer code in this library (as in, fewer lines to read to understand what they're doing).

About settings: we only have a small number of them, right?

TEST_MODE - maybe we could remove this by refactoring some code
META_HTTP_EQUIV - does that need to be a setting? (has it caused problems?)
SUPPRESS_EXCEPTIONS - probably makes the library much more usable in production -- but makes dev/debugging harder. environment variable seems good there
PLUGINS - handled by the plugins plan

hhursev · 2022-10-16T21:07:52Z

Agree on TEST_MODE
If removing META_HTTP_EQUIV doesn't break any tests it shouldn't be a setting
Just an environment variable for SUPPRESS_EXCEPTIONS, yes

Seems like settings module should be removed 😄

Implementing plugins as a default list in AbstractScraper sounds nice to begin with - yep. Let's go down this path.

Cleanup task related to #617

jayaddison · 2022-10-17T18:44:56Z

I'd like to introduce #650 (a migration to use tox), although I'm not totally sure what everyone's opinions might be about it. Feedback appreciated!

bfcarpio · 2022-10-19T05:03:13Z

I'd like to take a stab at the networking code. I've got a couple of ideas ranging from lazily adding a callback to writing some terrible XFactoryFactory OOP code.

jayaddison · 2022-10-19T17:42:58Z

in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags

From attempting some modifications, I'm going to suggest a small change to this part of the plan:

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

The benefit there is that we'll have a single place where settings are defined and read -- avoiding the multiple-config-files situation, and also avoiding multiple os imports appearing around the codebase.

jayaddison · 2022-10-21T15:46:55Z

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

class Template(AbstractScraper):

    # The self.soup object provides a way to write queries for contents on the page
    # For example: the following retrieves the contents of a 'title' HTML tag
    def title(self):
        return self.soup.find("title").get_text()

    # When a scraper field is accessed, the code in your method always runs first - it's a precise set of instructions provided by you.
    # You can also add generic extractors -- like 'schemaorg' here -- that are used if your code doesn't return a value during scraping.
    @field.schemaorg
    def ingredients(self):
        return self.soup.find(...)

    # Example: since this method is empty ('pass'), the opengraph metadata extractor will always be called
    # If opengraph extraction fails, then schemaorg extraction is called next
    @field.schemaorg
    @field.opengraph
    def image(self):
        pass

jayaddison · 2022-10-22T10:58:29Z

A pyproject.toml migration is ready for review in #655. After comparing the poetry and setuptools approaches, and learning various backwards compatibility and workflow questions related to them, I think that setuptools is a simpler route for us to take at the moment.

hhursev · 2022-10-22T23:58:12Z

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

I'd vote to make it just a settings.py/config.py file then and remove the RecipeScraperSettings class (whose idea was to allow users to write more complex settings (dict/list structures etc. through a file) and also settings change to take immediate effect w/o the need to restart the program/script using recipe_scrapers). We don't need that. FWIW I'd pick config.py for the naming.

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

I don't see it as improvement. In my opinion:

class Template(AbstractScraper):
    # .....
   
    def ingredients(self):
        return self.schema.ingredients()

    def image(self):
        return super().opengraph_image()

is good enough. If schema/opengraph image is there it will work. If not devs will adjust. I see no real downsides.

After comparing the poetry and setuptools approaches

let's not switch to poetry for now (ever?). I personally don't see much/any gains in our case. Open for discussion (I use poetry in other project I'm part of and see no real gain using it here).

jayaddison · 2022-10-28T17:45:01Z

I'm going to take some time out from maintenance/dev work here for a week or two. I feel like I've been commenting a lot and thinking/attempting a lot of stuff, but it's felt kinda disjointed or rushed at times.

Moving some of the larger/breaking changes here into the v15 branch makes sense I think - makes it easier to develop all that without worrying so much about maintaining compatibility. Hopefully in a few weeks I'll feel a bit refreshed to code up some more stuff.

hhursev · 2022-10-29T12:51:36Z

Hey @jayaddison take as much time as you need. v15 won't be merged in without your approval 😉

You've done a great job in the last month! Thank you very much. The pyproject.toml introduction, the speeding of the CI ⭐, the schema.org parser updates, merging incoming PRs in a speedy manner and addressing so much of the points in this issue. Prototyping different stuff and being prompt with communication on top of all that.

I really respect the work and time you spend on this project! Kudos to eager people like you supporting OSS.

jayaddison · 2022-12-17T13:39:06Z

remove the "plugins" idea altogether

Maybe there'd be an alternative implementation, but after getting back into the code again recently - I do think that the plugins are useful. #705 looks like a quick fix made possible by the SchemaOrg plugin, for example.

jayhale · 2023-01-02T22:54:10Z

Typing: Considering most developers are using IDEs that make use of type-hinting, it would be helpful if the base API were implemented with type signatures throughout.

jayaddison · 2023-09-14T11:27:18Z

Maybe controversial opinion: I think that the v15 branch should include a breaking change where instructions() returns type list[str] (instead of a newline-joined str), mirroring ingredients().

Making those two methods more equivalent should, I think, make the developer experience easier, and also I think it's a slightly better caller/user experience.

I won't lie: sometimes it's difficult to extract multi-item instructions from websites. But from experience in #852 recently, I think it's possible. In rare cases where it's not, or is too difficult, we can return a list containing a single item.

jayaddison · 2023-10-03T17:21:38Z

I've pushed commit 0b87a51 to the v15 branch -- it removes the scrape_me interface, meaning that the library doesn't have any network dependencies.

Is that good/bad? Is it a developer experience improvement? I'm not really sure. It's a bit annoying to have to retrieve some HTML from a URL, and then pass both the HTML and the URL for scraping. Not terrible, but compared to one import and one method call, it's kinda involved.

jayaddison · 2023-11-30T13:40:31Z

Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

I've published a release candidate, v15.0.0-rc1 that includes this change. I'll begin using that for RecipeRadar soon so that it gets some production testing.

jayaddison · 2023-11-30T13:42:15Z

Ah, no I haven't - v15.0.0-rc1 is completely network-isolated, but the suggestion there was to include a utility method to do the HTTP retrieval. I'll do that for rc2 (it'll re-add requests as an optional dependency, and I think that's fine).

jayaddison · 2023-11-30T13:43:28Z

Also: the scrape_html method can probably be simplified further in v15 to remove some of the proxy/timeout logic. That can move into the suggested utility method.

jayaddison · 2024-03-13T16:53:27Z

* Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.

@bfcarpio this might be super straightforward now that we use data-driven tests (#944). Would you like to take a look at that? I think a large amount of code could be cleaned up.

bfcarpio · 2024-03-14T16:53:49Z

I'll admit, I haven't worked on this repo in a long time. I'll be quite busy the next couple months as well. Funny how life challenges your priorities.

Yes, I would like to have this work done and reduce the code, but I can't commit to anything.

jayaddison added the help wanted label Oct 5, 2022

bickerdyke mentioned this issue Oct 14, 2022

Test integration in VisualStudio Code #640

Closed

jayaddison mentioned this issue Oct 16, 2022

ModuleNotFoundError when importing recipe_scrapers #641

Closed

5 tasks

jayaddison added a commit that referenced this issue Oct 16, 2022

Cleanup: remove META_HTTP_EQUIV setting

d9ad4a5

Cleanup task related to #617

jayaddison mentioned this issue Oct 16, 2022

Cleanup: remove META_HTTP_EQUIV setting #644

Merged

jayaddison added a commit that referenced this issue Oct 16, 2022

Cleanup: remove TEST_MODE setting

b0a4204

Cleanup task related to #617

This was referenced Oct 16, 2022

Cleanup: remove TEST_MODE setting #645

Merged

Migration: use tox to run unit tests and linting #650

Merged

This was referenced Oct 18, 2022

Reduce the amount of wording in the GitHub 'new issue' templates #651

Merged

Cleanup: remove 'v12-like' settings module #653

Merged

Modernization: introduce poetry as a packaging and dependency tool #654

Closed

bfcarpio mentioned this issue Oct 19, 2022

Networking code isolation and modularisation #658

Closed

This was referenced Oct 20, 2022

Cleanup: remove pytest and online testing mode #659

Merged

Remove unittest-parallel dependency #660

Merged

jayaddison mentioned this issue Oct 21, 2022

Provide migration path from OpenGraphImageFetchPlugin to opengraph_fallback decorator #665

Closed

jayaddison mentioned this issue Oct 26, 2022

Add scraper for smulweb.nl #673

Merged

jayaddison mentioned this issue Dec 26, 2022

[v15] Remove 'requests' library and 'responses' library dependencies #706

Merged

jayaddison mentioned this issue Jan 23, 2023

README dev instructions are currently broken #724

Closed

jayaddison mentioned this issue Sep 14, 2023

Adds support for errenskitchen.com #854

Merged

This was referenced Sep 15, 2023

Scraper Development Guide #861

Open

Developer guide docs #862

Merged

jayaddison mentioned this issue Oct 8, 2023

Adding canonical_url for all scrapers & associated updates #891

Merged

4 tasks

jayaddison mentioned this issue Oct 19, 2023

Canonical url intergration batch 3 #905

Merged

krisnoble mentioned this issue Mar 1, 2024

Add missing properties to Sally's Baking Addiction #1011

Merged

jayaddison mentioned this issue Mar 13, 2024

pre-commit: add sphinx-lint step #1024

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for developer experience improvements #617

Ideas for developer experience improvements #617

jayaddison commented Oct 1, 2022 •

edited

jayaddison commented Oct 1, 2022

jayaddison commented Oct 6, 2022

bfcarpio commented Oct 12, 2022 •

edited

jayaddison commented Oct 12, 2022

jayaddison commented Oct 12, 2022

bfcarpio commented Oct 13, 2022 •

edited

jayaddison commented Oct 13, 2022

jayaddison commented Oct 13, 2022

hhursev commented Oct 16, 2022 •

edited by jayaddison

jayaddison commented Oct 16, 2022

hhursev commented Oct 16, 2022

jayaddison commented Oct 17, 2022

bfcarpio commented Oct 19, 2022

jayaddison commented Oct 19, 2022

jayaddison commented Oct 21, 2022

jayaddison commented Oct 22, 2022

hhursev commented Oct 22, 2022

jayaddison commented Oct 28, 2022

hhursev commented Oct 29, 2022

jayaddison commented Dec 17, 2022

jayhale commented Jan 2, 2023 •

edited

jayaddison commented Sep 14, 2023

jayaddison commented Oct 3, 2023

jayaddison commented Nov 30, 2023

jayaddison commented Nov 30, 2023 •

edited

jayaddison commented Nov 30, 2023

jayaddison commented Mar 13, 2024

bfcarpio commented Mar 14, 2024

Ideas for developer experience improvements #617

Ideas for developer experience improvements #617

Comments

jayaddison commented Oct 1, 2022 • edited

jayaddison commented Oct 1, 2022

jayaddison commented Oct 6, 2022

bfcarpio commented Oct 12, 2022 • edited

jayaddison commented Oct 12, 2022

jayaddison commented Oct 12, 2022

bfcarpio commented Oct 13, 2022 • edited

jayaddison commented Oct 13, 2022

jayaddison commented Oct 13, 2022

hhursev commented Oct 16, 2022 • edited by jayaddison

jayaddison commented Oct 16, 2022

hhursev commented Oct 16, 2022

jayaddison commented Oct 17, 2022

bfcarpio commented Oct 19, 2022

jayaddison commented Oct 19, 2022

jayaddison commented Oct 21, 2022

jayaddison commented Oct 22, 2022

hhursev commented Oct 22, 2022

jayaddison commented Oct 28, 2022

hhursev commented Oct 29, 2022

jayaddison commented Dec 17, 2022

jayhale commented Jan 2, 2023 • edited

jayaddison commented Sep 14, 2023

jayaddison commented Oct 3, 2023

jayaddison commented Nov 30, 2023

jayaddison commented Nov 30, 2023 • edited

jayaddison commented Nov 30, 2023

jayaddison commented Mar 13, 2024

bfcarpio commented Mar 14, 2024

jayaddison commented Oct 1, 2022 •

edited

bfcarpio commented Oct 12, 2022 •

edited

bfcarpio commented Oct 13, 2022 •

edited

hhursev commented Oct 16, 2022 •

edited by jayaddison

jayhale commented Jan 2, 2023 •

edited

jayaddison commented Nov 30, 2023 •

edited