Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for developer experience improvements #617

Open
9 of 12 tasks
jayaddison opened this issue Oct 1, 2022 · 28 comments
Open
9 of 12 tasks

Ideas for developer experience improvements #617

jayaddison opened this issue Oct 1, 2022 · 28 comments

Comments

@jayaddison
Copy link
Collaborator

jayaddison commented Oct 1, 2022

There are a few places where we might be able to spruce up and improve the developer experience at the moment. Faster, more consistent, fewer configuration points, more straightforward workflows - that kind of stuff.

Some examples from recent experience:

@jayaddison
Copy link
Collaborator Author

Adding one more item: debugging plugin-related issues is tricky at the moment, and I think the reason is that it's not clear/transparent what plugins are enabled for each method call.

They're important because they catch and handle various forms of noisy web-based input for various method calls (title, ingredients, instructions in particular), but they're challenging to reason about and may produce unexpected behaviour at runtime.

Labelling every method with multiple decorators would look spammy, so I don't think that'd be a great approach. We should also bear in mind that plugins are developer-customizable -- callers may wish to opt-in or opt-out of various post-processing steps. And also we shouldn't introduce unnecessary overhead -- but nor should we allow bad/unsafe data on the basis of technical minimalism.

Taking HTML unescaping as an example: at the moment I think we HTML unescape twice for fields that we expect may contain HTML content. As far as I can tell, there's one particular site which incorrectly double-escapes HTML entities in their schema.org content, and that's why we do this. Perhaps the default should be to HTML unescape once, but allow decoration with html_unescape(rounds=2) or similar to override that per-site. And if a performance-sensitive caller felt very confident that a particular scraper required no unescaping, they could somehow override method calls to state html_unescape(rounds=0).

Basically I think that the default implementation should be "safe and correct" but with the opportunity to skip processing steps for power-users (by which I mean sophisticated power-data-users really -- sites with enough data to know where and when it makes sense to omit a step for performance reasons).

What this implies, I think, is a pipeline for each individual method. Perhaps taken to logical extremes it means that there is no "recipe scrapers", but instead a "recipe title scraper", "recipe instructions scraper", ... - and each is a defined pipeline for a {website, timerange, conditions} set. With the goal being to achieve near-100% (and accurate) coverage of known web content across all websites and timeranges.

I don't think we should attempt that yet - Python classes that wrap entire sites makes much more sense in terms of user convenience and developer attention (context-switching). But basically each method here is codifying "here's where we think that the remote data provider has put this particular item of information for a given URL", followed by "and here are the steps we wish to apply to convert that into something that we believe is safe and accurate for our users".

If we become good enough at that, we can "upstream" any errors and oddities back to the original source and basically create a positive feedback loop that reduces the number of exceptions required (while still ideally detecting them when they occur).

@jayaddison
Copy link
Collaborator Author

A few more notes for a developer guide:

  • It'd be worth including recommendations about what the output of methods -- particularly instructions -- should look like
  • FAQ section?
  • Guidance about exception handling and how to debug scrapers during development

And another idea: should we have a similar code review / maintainer guide?

@bfcarpio
Copy link
Collaborator

bfcarpio commented Oct 12, 2022

I'm going to add some random slightly thought out things:

  • Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.
  • Networking code moved to scrape_me: If scrape_me is the officially recommended way of using this package then why do we put so much code into the scraper constructors? It might help our async friends if they can write a single "networking" adapter they insert into the factory.
  • Replace requests with urllib: Not really thought out, but we don't use many features other than basic GET.
  • Use poetry: Personal bias and I know there was pushback before. I just don't like pip's version management.
  • Switch to pyproject.toml: Gotta keep up with how the language is evolving and this will clean up our codebase a touch.

@jayaddison What's the scraper that incorrectly double-escapes HTML? That might be a good individual issue.

@jayaddison
Copy link
Collaborator Author

Remove schema based scrapers that have no overloads

I mostly agree with this, with a few small notes:

  • This could make incremental mypy progress easier, because we'd have fewer methods to check & update
  • Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers
  • We do still need a hostname-to-scraper mapping somewhere

Networking code moved to scrape_me

Thinking about scraping as a pipeline could help here. There's content retrieval (HTML/WARC/etc), then parsing, then extraction of metadata, and then mapping the metadata to output format(s) (python method calls, JSON, ...).

Scraper developers should only really have to care about the extraction part (and only when schema / microdata / foo doesn't already extract what they need). How we most easily represent that in Python classes I'm not sure. At the moment the retrieval and extraction steps are slightly bundled together.

Replace requests with urllib

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Use poetry

I'm 50/50 on that. I agree the version management and tracking could be useful. At the moment (after some lxml support issues in #629) I'm more in the mindset of removing dependencies where possible (maybe I should focus on removing the ones that I added, first..)

Switch to pyproject.toml

Agreed, makes sense. Note: we do have a MANIFEST.in file that I think is still used, let's try not to forget about that during migration.

@jayaddison
Copy link
Collaborator Author

What's the scraper that incorrectly double-escapes HTML?

Unfortunately I'm not sure I kept a note of that. From a quick search, I think a few sites are affected (seriouseats, homechef, food52, ...). It's typically in recipe-related element contents, not the entire page (a developer adding an extra htmlencode(...) for safety, for example.

@bfcarpio
Copy link
Collaborator

bfcarpio commented Oct 13, 2022

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

We do still need a hostname-to-scraper mapping somewhere

What for? If a scraper can entirely be handled via wild_mode then wouldn't the client use AbstractScraper? I've forgotten if that's the correct type that's returned.

At the moment the retrieval and extraction steps are slightly bundled together.

I would think of scrape_me as the default, easy to use, batteries included retrieval implementation. If someone wants to use async code, or have their input be rabbitmq then they can write their own handler and pass it to a dedicated scraper class. I'm assuming here that the data is the entire HTML content of the page so extraction can be as normal.

Yep, sounds reasonable - although I did add responses to our unit test dependencies which may add more work for this :/

Oh, then never mind. It's not that big of a deal.

@jayaddison
Copy link
Collaborator Author

Remove schema based scrapers that have no overloads

Let's try to remember to apply changes like this to the template scraper early in the process -- that way we get ahead of additional incoming scrapers

As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?

A hypothetical scenario: let's say that a schema-supported scraper becomes nothing more than a class with a host method.

In that case, I don't think we want new pull requests with template-generated scrapers that have a complete set of self.schema.* method calls.

(at the risk of overcomplicating things: perhaps ideally we'd want the generator to attempt to call each schema method, and only include methods for fields that were missing entirely from the input HTML schema and microdata)

We do still need a hostname-to-scraper mapping somewhere

What for?

A couple of reasons I can think of - potentially resolvable:

  • Unit tests assert on the host() value. We could determine it from the canonical URL (if the HTML contains one) although not in a @classmethod (we'd need the page data for that approach)
  • Backwards compatibility: the fact that SCRAPERS is a top-level public export from the module

@jayaddison
Copy link
Collaborator Author

And after a bit more thought, two more reasons for keeping a hostname-to-scraper mapping:

  • For sites that don't include structured recipe metadata on each page, we need a way to discover the relevant scraper to use from an input URL
  • For sites where we want to apply augmentations/corrections to metadata retrieved from the page, then we need a way to override the default wild_mode results - and that also means we need to identify the relevant scraper to use

(those are both fairly important use-cases, I think)

@hhursev
Copy link
Owner

hhursev commented Oct 16, 2022

How about we do the following in '22:

  • remove the "plugins" idea altogether
  • remove v12_settings
  • in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags (opposed to the current where you set environment variable that points to a "settings" file)
  • remove requests. (And always lean towards less dependencies going forwards)
  • switch to pyproject.toml
  • simplify issues template (by half in terms of chars just to have a clear aim)

In a separate issue:

  • Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

@jayaddison
Copy link
Collaborator Author

Sounds ambitious, and good :)

About the plugins: before removing them, would it make sense to move them onto the AbstractScraper as a default list? People could still override them as necessary per-scraper, or in third-party libraries by subclassing, and I think it'd result in clearer code in this library (as in, fewer lines to read to understand what they're doing).

About settings: we only have a small number of them, right?

  • TEST_MODE - maybe we could remove this by refactoring some code
  • META_HTTP_EQUIV - does that need to be a setting? (has it caused problems?)
  • SUPPRESS_EXCEPTIONS - probably makes the library much more usable in production -- but makes dev/debugging harder. environment variable seems good there
  • PLUGINS - handled by the plugins plan

@hhursev
Copy link
Owner

hhursev commented Oct 16, 2022

  • Agree on TEST_MODE
  • If removing META_HTTP_EQUIV doesn't break any tests it shouldn't be a setting
  • Just an environment variable for SUPPRESS_EXCEPTIONS, yes

Seems like settings module should be removed 😄

Implementing plugins as a default list in AbstractScraper sounds nice to begin with - yep. Let's go down this path.

jayaddison added a commit that referenced this issue Oct 16, 2022
jayaddison added a commit that referenced this issue Oct 16, 2022
Cleanup task related to #617
@jayaddison
Copy link
Collaborator Author

I'd like to introduce #650 (a migration to use tox), although I'm not totally sure what everyone's opinions might be about it. Feedback appreciated!

@bfcarpio
Copy link
Collaborator

I'd like to take a stab at the networking code. I've got a couple of ideas ranging from lazily adding a callback to writing some terrible XFactoryFactory OOP code.

@jayaddison
Copy link
Collaborator Author

in case we end up with ~5 configs total, remove settings module altogether. Switch to environment flags

From attempting some modifications, I'm going to suggest a small change to this part of the plan:

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

The benefit there is that we'll have a single place where settings are defined and read -- avoiding the multiple-config-files situation, and also avoiding multiple os imports appearing around the codebase.

@jayaddison
Copy link
Collaborator Author

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

class Template(AbstractScraper):

    # The self.soup object provides a way to write queries for contents on the page
    # For example: the following retrieves the contents of a 'title' HTML tag
    def title(self):
        return self.soup.find("title").get_text()

    # When a scraper field is accessed, the code in your method always runs first - it's a precise set of instructions provided by you.
    # You can also add generic extractors -- like 'schemaorg' here -- that are used if your code doesn't return a value during scraping.
    @field.schemaorg
    def ingredients(self):
        return self.soup.find(...)

    # Example: since this method is empty ('pass'), the opengraph metadata extractor will always be called
    # If opengraph extraction fails, then schemaorg extraction is called next
    @field.schemaorg
    @field.opengraph
    def image(self):
        pass

@jayaddison
Copy link
Collaborator Author

A pyproject.toml migration is ready for review in #655. After comparing the poetry and setuptools approaches, and learning various backwards compatibility and workflow questions related to them, I think that setuptools is a simpler route for us to take at the moment.

@hhursev
Copy link
Owner

hhursev commented Oct 22, 2022

Let's reduce to exactly one recipe_scrapers.settings module, but have all of the settings within that be read from os.environ.

I'd vote to make it just a settings.py/config.py file then and remove the RecipeScraperSettings class (whose idea was to allow users to write more complex settings (dict/list structures etc. through a file) and also settings change to take immediate effect w/o the need to restart the program/script using recipe_scrapers). We don't need that. FWIW I'd pick config.py for the naming.

Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic?

I don't see it as improvement. In my opinion:

class Template(AbstractScraper):
    # .....
   
    def ingredients(self):
        return self.schema.ingredients()

    def image(self):
        return super().opengraph_image()

is good enough. If schema/opengraph image is there it will work. If not devs will adjust. I see no real downsides.

After comparing the poetry and setuptools approaches

let's not switch to poetry for now (ever?). I personally don't see much/any gains in our case. Open for discussion (I use poetry in other project I'm part of and see no real gain using it here).

@jayaddison
Copy link
Collaborator Author

I'm going to take some time out from maintenance/dev work here for a week or two. I feel like I've been commenting a lot and thinking/attempting a lot of stuff, but it's felt kinda disjointed or rushed at times.

Moving some of the larger/breaking changes here into the v15 branch makes sense I think - makes it easier to develop all that without worrying so much about maintaining compatibility. Hopefully in a few weeks I'll feel a bit refreshed to code up some more stuff.

@hhursev
Copy link
Owner

hhursev commented Oct 29, 2022

Hey @jayaddison take as much time as you need. v15 won't be merged in without your approval 😉

You've done a great job in the last month! Thank you very much. The pyproject.toml introduction, the speeding of the CI ⭐, the schema.org parser updates, merging incoming PRs in a speedy manner and addressing so much of the points in this issue. Prototyping different stuff and being prompt with communication on top of all that.

I really respect the work and time you spend on this project! Kudos to eager people like you supporting OSS.

@jayaddison
Copy link
Collaborator Author

remove the "plugins" idea altogether

Maybe there'd be an alternative implementation, but after getting back into the code again recently - I do think that the plugins are useful. #705 looks like a quick fix made possible by the SchemaOrg plugin, for example.

@jayhale
Copy link

jayhale commented Jan 2, 2023

Typing: Considering most developers are using IDEs that make use of type-hinting, it would be helpful if the base API were implemented with type signatures throughout.

@jayaddison
Copy link
Collaborator Author

Maybe controversial opinion: I think that the v15 branch should include a breaking change where instructions() returns type list[str] (instead of a newline-joined str), mirroring ingredients().

Making those two methods more equivalent should, I think, make the developer experience easier, and also I think it's a slightly better caller/user experience.

I won't lie: sometimes it's difficult to extract multi-item instructions from websites. But from experience in #852 recently, I think it's possible. In rare cases where it's not, or is too difficult, we can return a list containing a single item.

@jayaddison
Copy link
Collaborator Author

I've pushed commit 0b87a51 to the v15 branch -- it removes the scrape_me interface, meaning that the library doesn't have any network dependencies.

Is that good/bad? Is it a developer experience improvement? I'm not really sure. It's a bit annoying to have to retrieve some HTML from a URL, and then pass both the HTML and the URL for scraping. Not terrible, but compared to one import and one method call, it's kinda involved.

@jayaddison
Copy link
Collaborator Author

Move away "networking" code and allow easy usage of the package with async/proxies/timeout (we should leave an example code too). no demo with javascript/browser check bypass. Just making a [new entity] that does the HTTP requests needed and gets the HTML. -> pass that down to our scrapers.

I've published a release candidate, v15.0.0-rc1 that includes this change. I'll begin using that for RecipeRadar soon so that it gets some production testing.

@jayaddison
Copy link
Collaborator Author

jayaddison commented Nov 30, 2023

Ah, no I haven't - v15.0.0-rc1 is completely network-isolated, but the suggestion there was to include a utility method to do the HTTP retrieval. I'll do that for rc2 (it'll re-add requests as an optional dependency, and I think that's fine).

@jayaddison
Copy link
Collaborator Author

Also: the scrape_html method can probably be simplified further in v15 to remove some of the proxy/timeout logic. That can move into the suggested utility method.

@jayaddison
Copy link
Collaborator Author

* Remove schema based scrapers that have no overloads: They generate no value since they use the default schema implementation, but we have the tests that prove they work. Maybe we can just say "our schema based scraper is tested on the following websites, but tends to work everywhere with an embedded schema" and keep the tests.

@bfcarpio this might be super straightforward now that we use data-driven tests (#944). Would you like to take a look at that? I think a large amount of code could be cleaned up.

@bfcarpio
Copy link
Collaborator

I'll admit, I haven't worked on this repo in a long time. I'll be quite busy the next couple months as well. Funny how life challenges your priorities.

Yes, I would like to have this work done and reduce the code, but I can't commit to anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants