Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ploetzblog + configuration for unit tests #1100

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mlduff
Copy link
Contributor

@mlduff mlduff commented Apr 26, 2024

This PR adds support for Ploetzblog. There was an issue when adding this scraper (as detailed here, with discussion following). This was that the assumption that ingredient groups always add up to form the ingredients list is incorrect, as it doesn't account for differing amounts of the same ingredient.

I have basically gone for the "exempt this scraper from that particular test". To do this, I could of converted it to a legacy test and implemented all the tests myself, however I figured that allowing for configuration in non-legacy unit tests could be beneficial, so I added an _options entry to the unit case which optionally specifies certain overrides for the test executor. This is done in such a way that someone would have to deliberately disable this test.

Resolves #1062

tests/__init__.py Outdated Show resolved Hide resolved
@jayaddison
Copy link
Collaborator

I'm in two minds about this; I agree that selectively bypassing the ingredient-consistency test could make sense for some scraper tests, and I think that the use of a key named _options in the JSON is a neat way to place that somewhere that won't conflict with the test fields. However: I'm not so sure about whether we want to support per-scraper-test options in general -- and I don't want us to add generic support for that if it makes it more likely to happen. It's an anti-pattern, basically: the JSON files are intended to be test data that we compare against, and to add configuration/conditional flags to that seems non-ideal.

@mlduff
Copy link
Contributor Author

mlduff commented Apr 29, 2024

@jayaddison Those are some good points. Could another option be to have a completely separate file in each scraper folder dedicated to configuration of the test cases? In this way the configuration would apply across all tests for that scraper (if there are multiple JSON/HTML files) and it would be separated from the data. Or is this still an anti-pattern?

@jayaddison
Copy link
Collaborator

Those are all possibilities, yep! As mentioned in the issue thread, the natural way to handle this would be by marking/overriding the relevant test case and skipping it in Python code. Re-implementing that for our custom data format -- whether in an option, a config file, or elsewhere -- doesn't seem great, because it could be the start of a path down which we end up wanting to re-implement other existing features of unittest / pytest / etc.

@mlduff
Copy link
Contributor Author

mlduff commented Apr 29, 2024

@jayaddison They are good points again - if we wanted to avoid re-inventing unittest, then I think the only way would be to take Ploetzblog out of the data driven approach and convert to a legacy test (I could perhaps still have it load from the custom data format?). My preference would still be to go down the configuration path, as it allows classes to leverage the ease-of-use that the custom data format brings, but I am happy to do whatever you wish.

@jayaddison
Copy link
Collaborator

Thinking about it. For the scraper tests, there are currently two categories:

  • Data driven tests - most of them.
  • Legacy tests - tests for scrapers that use API/multi-HTTP-requests and may not be supported in future (v15 is moving towards a model where the client requests the HTML, so this library could eventually have no network/online dependencies itself).

What you're suggesting might be a third category - it's a test case for a scraper where for some reason the data-driven tests aren't applicable.

We probably wouldn't need to add those very often, and you could copy it more-or-less from the existing legacy tests (you probably wouldn't need the expected_requests though).

If we do that, perhaps we should try to find a better name than 'legacy tests' for the tests-that-use-multiple-HTTP-requests (and a better name than that, too).

@mlduff
Copy link
Contributor Author

mlduff commented Apr 30, 2024

@jayaddison Is there some kind of way to have a base class, which represents all the data-driven test cases. This could then be inherited and overridden by the test class (Ploetzblog) with custom tests. My goal is to have as little duplication required as possible when there are minor deviations from the expected format.

This isn't something I know how to do off the top of my head, but if you're happy with that approach I could do some more research into the specifics. Otherwise, I would be happy to just create another legacy test (where I would override the test_consistent_ingredients_lists).

@jayaddison
Copy link
Collaborator

@mlduff I think that's possible, yep, but I have to check if I understand the idea: is what you're suggesting somehow similar to the way that we dynamically create a scraper class for some websites when the wild_mode flag is enabled?

@jayaddison
Copy link
Collaborator

Also, something to bear in mind: a benefit of data-driven tests is that it's possible for people to read the .json files and to check -- without having to know code for any specific scraper -- whether the data matches the corresponding test HTML file. Making it possible to override that on a per-scraper basis would begin to make it harder to reason about what the .json represents -- because the person would first need to check for existence of, and if necessary read, the override behaviours. I don't think that'd be a good process or system design.

@mlduff
Copy link
Contributor Author

mlduff commented May 4, 2024

@jayaddison I have refactored a lot of the data driven test code into separate functions inside a newly created file data_utils.py. I then use these functions to define my custom test, under the new data_driven folder (more than happy to think of new names).

This allows for the flexibility of creating custom tests where the assumptions in the generic test are wrong, whilst still encouraging the test case to be written predominantly in JSON (greatly increasing how easy it is to understand, as you mentioned).

@mlduff mlduff requested a review from jayaddison May 5, 2024 01:23
@jayaddison
Copy link
Collaborator

@mlduff I'd like to try to keep things as simple as possible for our developers and code reviewers, and after weighing up some of the options you've presented.. it feels like cleanest approach -- if it's possible! -- would be to parse the quantities when testing each scraper and comparing ingredients to ingredient_groups.

I've added a comment about that to #1062. I'll try to find some time for a sample implementation of it this week, but don't yet know whether I'll get around to that.

@mlduff
Copy link
Contributor Author

mlduff commented May 7, 2024

@jayaddison If you're happy for me to I would be happy to have a go implementing it - if you'd rather me leave it for you that is fine too, just let me know.

@mlduff
Copy link
Contributor Author

mlduff commented May 9, 2024

Worth pointing out there is some discussion here on the topic @jayaddison: #733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Add] Plötzblog Scraper
2 participants