Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for kptncook recipes #503

Merged
merged 2 commits into from
Mar 2, 2022
Merged

Add support for kptncook recipes #503

merged 2 commits into from
Mar 2, 2022

Conversation

gloriousDan
Copy link
Contributor

@gloriousDan gloriousDan commented Feb 28, 2022

Add support for kptncook recipes

example: http://mobile.kptncook.com/recipe/pinterest/Low-Carb-Tarte-Flamb%C3%A9e-with-Serrano-Ham-%26-Cream-Cheese/315c3c32?lang=en

On the kptncook website the recipes are only provided in a redacted form (most of the instructions are missing) so that users are incentivized to install the app.
However after analysing the app I found some API endpoints from which the whole recipe data can be requested as json.
(My writedown of the results can be found here: https://github.com/gloriousDan/kptncook-api-reverse-engineering)

I hope that this is still in the scope of this scraping framework, since my implementation goes quite a long way from simple html scraping.

Since this is my first PR ever I'd love to get feedback on my implementation and the merging process :)

@jayaddison jayaddison merged commit badb54c into hhursev:main Mar 2, 2022
@coveralls
Copy link

coveralls commented Mar 2, 2022

Coverage Status

Coverage decreased (-0.2%) to 95.377% when pulling c48ec72 on gloriousDan:main into 55881c7 on hhursev:main.

@jayaddison
Copy link
Collaborator

Thank you @gloriousDan! This is now published and available in recipe-scrapers v13.19.0.

@jayaddison
Copy link
Collaborator

Hey @gloriousDan - apologies, I missed your note regarding feedback on the code and pull request when I reviewed this (I was looking at it again related to a discussion about places where we use "" (empty string) vs None in Python at #559 (comment))

The lack of review comments should, hopefully, be taken as a good sign :) It meant I didn't spot anything unusual or problematic pattern-wise. That said, maybe I was not fully focused since I didn't read your pull request description completely, and didn't notice the empty string return value.

I'll aim to take another review pass soon and provide more detailed feedback.

@gloriousDan
Copy link
Contributor Author

That's how I interpreted it at the time. Feel free to leave comments if you notice anything and I'll change it in another PR.

ingredient.get("measure"),
ingredient["ingredient"]["title"],
]
if self.lang not in IMPERIAL_LANGUAGES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that the API returns both quantity (metric) and quantityUS (U.S. imperial) units in search responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when scraping I need to decide which quantities to include. Or is there a way to include both the metric quantities and US quantities in the scrape output?
Otherwise all Languages listed in IMPERIAL_LANGUAGES get the imperial measurements

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK, I don't have a problem with this approach - it's just a subtle detail that I suppose could affect or surprise some users of the library in future, and so I'd like to take a bit of a pause to consider it.

Roughly speaking, it seems like users of the library who want to make use of quantity-related information from ingredients will either be doing that on a small scale (a personal recipe collection, for example) - where having units that are familiar to the reader makes sense -- or on a large scale, where they'll likely already have to choose whether to narrow down to recipes that use a single unit system, or whether to handle ingredient descriptions across various unit systems.

For the small-scale use case: we don't really know the user's locale in the context of this library, but using the best-guess language of the rendered webpage (self.lang) is probably the best approximation we have for it at the moment.

It might be preferable to return the units in the 'native' form they were written by the original recipe author -- because that's likely the best representation of what they intended, before any unit conversion takes place -- although it's unclear to me from the microdata whether we can determine what the original unit system was.

One other consideration is that we could attempt to make life easier for consumers of the library by indicating whether we think that the units are imperial/metric. In this particular situation, we do know that -- but for most scrapers, we don't, and I don't think we should consider that our expertise, so despite considering that possibility, I don't think it makes sense to include a unit-system indicator in ingredient results.

Quite a lot of text for a fairly small item; roughly speaking I think that this is fine as it is. If it becomes clear over time that we can improve the code by making changes here, then we can do that at a later date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well when first implementing it. Unfortunately I'm not familiar with what measurement types are preferred in which countries.

Kptncook makes choosing the right quantity and measure quite complicated because it provides many different quantities and measurements:

"quantity": 100.0,
"measure": "g",
"metricQuantity": 100.0,
"metricMeasure": "g",
"quantityUS": 0.5,
"measureUS": "cup(s)",
"imperialQuantity": 0.5,
"imperialMeasure": "cup(s)",
"quantityUSProd": 3.5,
"measureUSProd": "oz",
"imperialProductQuantity": 3.5,
"imperialProductMeasure": "oz",

Thus, when using the "imperial" measurements this does not mean that units like "oz" are used but rather, that cups and tbsp/ tsp are used instead of grams and liters.
From some experimentation, I found that countries like e.g. spain also use cups preferably but then again portugal doesn't .
This makes it really hard/ tedious for me to decide what's supposed to be the default value, so I just set it for english.
Since cups and tbsp, etc. are understandable even for countries which use metric units and that's also true the other way around I think this is alright

Maybe we could implement some logic which cross references the scraped html to check which units are used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, even more complicated than I realized :)

The idea about cross-referencing the HTML to determine the native format is innovative, although it adds a bit of a code complexity, and probably makes it slightly more likely that the scraper could break (if the HTML form of the pages changes). Where possible I think simpler approaches are usually best.

From the recipes you've seen so far, are metricQuantity and metricMeasure always available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a look in the App and it exposes the following settings:
Screenshot_20220704-105658

The selectable languages are:
German, English, Spanish, French and Portuguese
Since this list of countries is quite small I could just look at how the recipes are displayed in each language and set the default accordingly.

The region settings includes a lot of different regions and I'm not sure how this setting affects the measurements and quantities or how to set it in the API

Setting defaults for scraping with recipe-scrapers is quite difficult, since especially for the english language people may expect very different results:

  • In the US Fahrenheit and imperial units
  • In GB °C and metric/ other imperial units (I'm not sure what the correct units here would be)
  • People from other countries that just want the recipes in english or where they are not available in their language might want other mixes of settings.

Is there a way to let the user specify any of those settings?

Copy link
Collaborator

@jayaddison jayaddison Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the analysis!

One bit of preamble: the combination of a country (location) and a language is called a 'locale'. Although a lot of software uses en-US (English, USA) by default, there are lots of different locales, with subtle or not-so-subtle cultural differences (pt-BR, pt-PT, en-GB, de-CH, it-CH, fr-CH, ...). Those can infer (but not guarantee) all kinds of things, including user-preferred unit systems.

Measurement Systems
The way I think about it at the moment, there are usually "ground truth" quantities for ingredients within a recipe (they're the quantities as-written in the original source recipe text by the author) and then there is a separate "desired presentation" for a given recipe at render-time (the user's experiential preferences: text/audio/video, localization, unit conversion, formatting, ...).

Is there a way to let the user specify any of those settings?

Currently, no - but to explain in more detail:

Generally recipe-scrapers attempts to provide a structured API to retrieve aspects of the ground truth, but we don't try to do much more than that. Essentially: "provide us with the URL for a recipe, and we'll do our best to return the metadata fields you're interested in".

While it's probably true that most recipe viewer applications want to request and handle the user's locale and display preferences (and likely that individual apps cater to specific requirements), I don't think this library is the place to do that.

Situation for this scraper
kptncook may be an unusual case because it isn't clear what the ground truth quantities-as-originally-written were - there are multiple, using different unit systems -- so we can't (easily?) determine the "canonical" unit system for each recipe.

(note: technically an HTML recipe server could have rendered the source webpage differently depending on the request locale and/or other factors (in other words: ground truth can, in fact, be relative to the user). my opinion here glosses over that, but in practice it could be very important for some sites (and therefore scrapers). as far as I'm aware, we don't have any data about how often each website we support does that, either historically or currently)

Other relevant cases
Running a naive, case-insensitive search for the presence of both metric and imperial across the scraper test HTML (grep -il metric `grep -ril imperial tests/test_data/` ) returns five results out of ~230 scrapers (ls -l tests/test_data/*.testhtml|wc -l). So this multi-unit-system question does occur elsewhere, and not infrequently, but initially it doesn't appear to be an overwhelming priority to handle.

(note: we as maintainers don't currently have easy-to-use statistics on how often each individual scraper runs - so that estimate is based on giving equal weight to each scraper)

Opinion
As a result my personal opinion is that -- unless it's possible to determine the author's original written quantities reliably -- we should pick a simple and reliable algorithm to select the unit system within kptncook and use that.

I think your existing implementation does that: I might suggest that always-pick-metric would be even simpler, but I don't have a strong enough opinion to want to change the code.

That's not data-or-reality-perfect, I realize - it's an attempt to minimize the implementation and maintainance cost across the range of scrapers while continuing to provide structured data. If this project becomes more important and gains more traction and engineering expertise, I'm happy to be overruled by a more widely-applicable approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also happy to be overruled if I've just misunderstood the situation and/or there are better, easier approaches possible; what's presented is my best attempt to figure out the options and workable, usually-applicable guidelines. I'm frequently wrong about things :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good summary of the problems and a detailed explanation. Thank you very much.
I agree with you here and because of the difficulty of implementing such a cross-referencing mechanism and the maintenance overhead if the site changes I also prefer simpler mechanisms. As to setting metric as default: As a german I also prefer metric over imperial, but I can't really speak for what the expected behaviour would be for e.g. people from the UK. I'm open to changing it to "always metric" but that might upset some people from the US because they're unable to measure in grams. Thus I would just leave it as is for now and if someone complains we can still adjust the behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gloriousDan - we're in agreement :) (as someone from the UK, I personally prefer metric - and think it's generally preferred for most measurements with the exception of distance for some reason, but we'll see what people think about those here over the next few months). Thanks for providing the scraper implementation and being involved in the discussion about it.

Copy link
Collaborator

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few details and notes from a subsequent and more-detailed review - in particular I'd be keen to see whether we can remove the potential for duplicate HTTP requests on sharing. links -- if my analysis there is correct. The scraper continues to look fine to me otherwise 👍

(I'd personally hope the world can standardize on metric everywhere at some point... but that's probably beyond the scope of this pull request 😉)

@gloriousDan
Copy link
Contributor Author

Thanks for the thorough review, I'll have a look at your notes soon and open an additional PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants