New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for developer experience improvements #617
Comments
Adding one more item: debugging plugin-related issues is tricky at the moment, and I think the reason is that it's not clear/transparent what plugins are enabled for each method call. They're important because they catch and handle various forms of noisy web-based input for various method calls ( Labelling every method with multiple decorators would look spammy, so I don't think that'd be a great approach. We should also bear in mind that plugins are developer-customizable -- callers may wish to opt-in or opt-out of various post-processing steps. And also we shouldn't introduce unnecessary overhead -- but nor should we allow bad/unsafe data on the basis of technical minimalism. Taking HTML unescaping as an example: at the moment I think we HTML unescape twice for fields that we expect may contain HTML content. As far as I can tell, there's one particular site which incorrectly double-escapes HTML entities in their schema.org content, and that's why we do this. Perhaps the default should be to HTML unescape once, but allow decoration with Basically I think that the default implementation should be "safe and correct" but with the opportunity to skip processing steps for power-users (by which I mean sophisticated power-data-users really -- sites with enough data to know where and when it makes sense to omit a step for performance reasons). What this implies, I think, is a pipeline for each individual method. Perhaps taken to logical extremes it means that there is no "recipe scrapers", but instead a "recipe title scraper", "recipe instructions scraper", ... - and each is a defined pipeline for a {website, timerange, conditions} set. With the goal being to achieve near-100% (and accurate) coverage of known web content across all websites and timeranges. I don't think we should attempt that yet - Python classes that wrap entire sites makes much more sense in terms of user convenience and developer attention (context-switching). But basically each method here is codifying "here's where we think that the remote data provider has put this particular item of information for a given URL", followed by "and here are the steps we wish to apply to convert that into something that we believe is safe and accurate for our users". If we become good enough at that, we can "upstream" any errors and oddities back to the original source and basically create a positive feedback loop that reduces the number of exceptions required (while still ideally detecting them when they occur). |
A few more notes for a developer guide:
And another idea: should we have a similar code review / maintainer guide? |
I'm going to add some random slightly thought out things:
@jayaddison What's the scraper that incorrectly double-escapes HTML? That might be a good individual issue. |
I mostly agree with this, with a few small notes:
Thinking about scraping as a pipeline could help here. There's content retrieval (HTML/WARC/etc), then parsing, then extraction of metadata, and then mapping the metadata to output format(s) (python method calls, JSON, ...). Scraper developers should only really have to care about the extraction part (and only when schema / microdata / foo doesn't already extract what they need). How we most easily represent that in Python classes I'm not sure. At the moment the retrieval and extraction steps are slightly bundled together.
Yep, sounds reasonable - although I did add
I'm 50/50 on that. I agree the version management and tracking could be useful. At the moment (after some
Agreed, makes sense. Note: we do have a |
Unfortunately I'm not sure I kept a note of that. From a quick search, I think a few sites are affected ( |
As in put comments to first test with wild mode or you mean even first put a trail of code that would help someone parse HTML better?
What for? If a scraper can entirely be handled via
I would think of
Oh, then never mind. It's not that big of a deal. |
A hypothetical scenario: let's say that a schema-supported scraper becomes nothing more than a class with a In that case, I don't think we want new pull requests with template-generated scrapers that have a complete set of (at the risk of overcomplicating things: perhaps ideally we'd want the generator to attempt to call each schema method, and only include methods for fields that were missing entirely from the input HTML schema and microdata)
A couple of reasons I can think of - potentially resolvable:
|
And after a bit more thought, two more reasons for keeping a hostname-to-scraper mapping:
(those are both fairly important use-cases, I think) |
How about we do the following in '22:
In a separate issue:
|
Sounds ambitious, and good :) About the plugins: before removing them, would it make sense to move them onto the About settings: we only have a small number of them, right?
|
Seems like settings module should be removed 😄 Implementing plugins as a default list in |
I'd like to take a stab at the networking code. I've got a couple of ideas ranging from lazily adding a callback to writing some terrible |
From attempting some modifications, I'm going to suggest a small change to this part of the plan: Let's reduce to exactly one The benefit there is that we'll have a single place where settings are defined and read -- avoiding the multiple-config-files situation, and also avoiding multiple |
Would the following be an improvement for developer experience, or would it be a misguided attempt to make the code more Pythonic? class Template(AbstractScraper):
# The self.soup object provides a way to write queries for contents on the page
# For example: the following retrieves the contents of a 'title' HTML tag
def title(self):
return self.soup.find("title").get_text()
# When a scraper field is accessed, the code in your method always runs first - it's a precise set of instructions provided by you.
# You can also add generic extractors -- like 'schemaorg' here -- that are used if your code doesn't return a value during scraping.
@field.schemaorg
def ingredients(self):
return self.soup.find(...)
# Example: since this method is empty ('pass'), the opengraph metadata extractor will always be called
# If opengraph extraction fails, then schemaorg extraction is called next
@field.schemaorg
@field.opengraph
def image(self):
pass |
A |
I'd vote to make it just a settings.py/config.py file then and remove the
I don't see it as improvement. In my opinion:
is good enough. If schema/opengraph image is there it will work. If not devs will adjust. I see no real downsides.
let's not switch to poetry for now (ever?). I personally don't see much/any gains in our case. Open for discussion (I use poetry in other project I'm part of and see no real gain using it here). |
I'm going to take some time out from maintenance/dev work here for a week or two. I feel like I've been commenting a lot and thinking/attempting a lot of stuff, but it's felt kinda disjointed or rushed at times. Moving some of the larger/breaking changes here into the |
Hey @jayaddison take as much time as you need. You've done a great job in the last month! Thank you very much. The pyproject.toml introduction, the speeding of the CI ⭐, the schema.org parser updates, merging incoming PRs in a speedy manner and addressing so much of the points in this issue. Prototyping different stuff and being prompt with communication on top of all that. I really respect the work and time you spend on this project! Kudos to eager people like you supporting OSS. |
Maybe there'd be an alternative implementation, but after getting back into the code again recently - I do think that the plugins are useful. #705 looks like a quick fix made possible by the |
Typing: Considering most developers are using IDEs that make use of type-hinting, it would be helpful if the base API were implemented with type signatures throughout. |
Maybe controversial opinion: I think that the Making those two methods more equivalent should, I think, make the developer experience easier, and also I think it's a slightly better caller/user experience. I won't lie: sometimes it's difficult to extract multi-item instructions from websites. But from experience in #852 recently, I think it's possible. In rare cases where it's not, or is too difficult, we can return a list containing a single item. |
I've pushed commit 0b87a51 to the Is that good/bad? Is it a developer experience improvement? I'm not really sure. It's a bit annoying to have to retrieve some HTML from a URL, and then pass both the HTML and the URL for scraping. Not terrible, but compared to one import and one method call, it's kinda involved. |
I've published a release candidate, v15.0.0-rc1 that includes this change. I'll begin using that for RecipeRadar soon so that it gets some production testing. |
Ah, no I haven't - v15.0.0-rc1 is completely network-isolated, but the suggestion there was to include a utility method to do the HTTP retrieval. I'll do that for |
Also: the |
@bfcarpio this might be super straightforward now that we use data-driven tests (#944). Would you like to take a look at that? I think a large amount of code could be cleaned up. |
I'll admit, I haven't worked on this repo in a long time. I'll be quite busy the next couple months as well. Funny how life challenges your priorities. Yes, I would like to have this work done and reduce the code, but I can't commit to anything. |
There are a few places where we might be able to spruce up and improve the developer experience at the moment. Faster, more consistent, fewer configuration points, more straightforward workflows - that kind of stuff.
Some examples from recent experience:
tox
to provide a consistent unit test and linting approach (motivation: added .venv to flake8 ignored directories #615) -- (status: done in Migration: use tox to run unit tests and linting #650)pytest
: do we still need it? (status: removed in Cleanup: remove pytest and online testing mode #659)run_tests.py
: do we still need it? (status: removed in Remove unittest-parallel dependency #660)Use the same method order in template scraper and template unit tests (motivation: see Add scraper for smulweb.nl #673 (comment))(this was a misunderstanding - Python'sunittest
framework sorts test case methods by name before running them)sphinx-lint
for the README file?mypy
: type-check method signatures on subclasses ofAbstractScraper
(this is most of them!) - depends on untyped defs on derived classes python/mypy#3903schema.org
scraper would return -- so that we can clean up and keep code minimal.normalize_string
easier to use, and reducing repetition of it throughout the codebase, or making those code paths more Pythonic)The text was updated successfully, but these errors were encountered: