MarleySpoon: add precautionary check for unexpected API URLs. #1069

jayaddison · 2024-04-19T16:22:30Z

Adds a sense-checking step to ensure that the API URL returned in the MarleySpoon script elements refers to a second-level-domain containing marleyspoon.

As far as I know, there's currently no standardized and machine-readable way to declare inter-related ownership of a group of distinct Internet domain names. I could be mistaken though; and if so, perhaps we could use that as a better alternative than checking for the brand name, as found in the scraper name.

jayaddison · 2024-04-19T16:35:35Z

After considering the implications of a comment I wrote at #1064 (comment) I thought it'd be worth checking our existing scrapers for the potential that requests could be made to unexpected domains, and MarleySpoon appeared as a possibility, so this changeset adds a safety measure against requests to API hosts that seem unrelated.

cc @jknndy @hhursev @strangetom for review

recipe_scrapers/marleyspoon.py

…-in-behaviour.

jayaddison · 2024-04-19T16:52:55Z

One more note: it should be possible to generalize this check so that it could apply to other scrapers too; I've attempted to write it in a way that would allow for that.

It's only written for MarleySpoon because that's the only place that I found where I felt that this problem could occur; in the other cases where we make multiple requests, as far as I can tell, we always use fully-qualified literal strings (with some allowance for templating) to define the request URL.

In addition, this problem can only currently affect legacy scrapers in the v14 branch of the codebase. That's not intended to be an argument to move to v15! We do lose functionality during that transition. But it helps to narrow the scrapers that should be checked for problems.

strangetom · 2024-04-19T17:50:45Z

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

jayaddison · 2024-04-22T14:47:31Z

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

@strangetom that does appear to be the case, yep. However, I'd be slightly reluctant to hard-code it, given that they've intentionally made it a variable in the page data. There are situations where doing that can allow for load-balancing / migrations / temporary maintenance by sending a portion of traffic to a different API endpoint, and it'd be nice to (safely) continue to respect that if we can.

jayaddison · 2024-04-22T14:52:41Z

Implementing Cross-Origin Resource Sharing adherence in the scraper could be another way to do this, in a more standards-compliant manner. I wasn't able to find any HTTP-client-side Python CORS libraries from a quick search (plenty of server-side ones), but perhaps there are some out there (or it might not be too onerous to implement basic support).

jayaddison · 2024-04-30T12:05:19Z

Alternatively perhaps we could check wheter the URL found in the JavaScript configuration corresponds to an entry in the SCRAPERS map for the same scraper instance?

Explained alternatively:

Input an html and org_url as usual.
Map org_url to a ScraperCls from SCRAPERS as usual.
If ScraperCls wants to make an additional HTTP request:
- Store the request URL in next_url.
- Map next_url to a NextScraperCls from SCRAPERS (as in step 2).
- Is NextScraperCls the same scraper as ScraperCls?
  - If so, allow the HTTP request to proceed.
  - If not, reject the HTTP request.

…her a request is valid or not.

…o originating-exception.

…, but avoids a circular import).

… host domain name.

…ommendations / requirements.

jayaddison · 2024-04-30T12:46:14Z

recipe_scrapers/marleyspoon.py

+        scraper_name = self.__class__.__name__
+        try:
+            next_url = urljoin(self.url, api_url)
+            host_name = get_host_name(next_url)
+            next_scraper = type(None)
+            # check: api.foo.xx.example, foo.xx.example, xx.example
+            while host_name and host_name.count("."):
+                next_scraper = SCRAPERS.get(host_name)
+                if next_scraper:
+                    break
+                _, host_name = host_name.split(".", 1)
+            if not isinstance(self, next_scraper):
+                msg = f"Attempted to scrape using {next_scraper} from {scraper_name}"
+                raise ValueError(msg)
+        except Exception as e:
+            raise RecipeScrapersExceptions(f"Unexpected API URL: {api_url}") from e


My attempt to translate this code into a natural-language description:

When scraping a website, ensure that any additional page requests are to hosts that belong to the set of domains supported by the scraper and its subclasses.

jayaddison · 2024-05-03T12:41:30Z

I'd like to include this in the next v14 release, probably early next week, unless anyone has concerns about it. It would restrict the HTTP requests that the MarleySpoon scrapers could make, but only by limiting those to domains we've configured MarleySpoon scraper mappings for.

jayaddison added 4 commits April 19, 2024 17:17

MarleySpoon: add precautionary check for unexpected API URLs.

5518ae2

Fixup: linting: remove unused variable.

06e5bf8

Fixup: linting: use isort to re-order imports.

cf9c059

Fixup: linting: apply pyupgrade (py3.8+) to test module.

c25b0e3

jayaddison commented Apr 19, 2024

View reviewed changes

recipe_scrapers/marleyspoon.py Outdated Show resolved Hide resolved

jayaddison added 4 commits April 19, 2024 17:38

MarleySpoon: remove use of variable shadowing that introduce a change…

5f2a6bd

…-in-behaviour.

MarleySpoon: tests: rename test case.

39cc788

MarleySpoon: tests: add coverage relative-URL API host case.

ca2154f

MarleySpoon: tests: brevity: rename 'valid_url' to 'url'.

c24cc7b

jayaddison added 7 commits April 30, 2024 13:15

MarleySpoon: adjustment: use is-same-scraper condition to decide whet…

eb286cb

…her a request is valid or not.

MarleySpoon: exception handling: include link from raised-exception t…

b06eec9

…o originating-exception.

MarleySpoon: fixup: add missing SCRAPERS import (localised; not ideal…

9c94ee9

…, but avoids a circular import).

MarleySpoon: reduce constraint: allow less-precise matches on partial…

7561de5

… host domain name.

MarleySpoon: linting: adjust code to comply with black code style rec…

2a5e003

…ommendations / requirements.

MarleySpoon: refactor: adjust domain-climbing logic.

ff02a0c

MarleySpoon: cleanup: remove unused import.

1dfd79b

jayaddison commented Apr 30, 2024

View reviewed changes

jayaddison mentioned this pull request May 2, 2024

Updates to AmericasTestKitchen scraper #1116

Merged

jayaddison merged commit 94b2617 into main May 6, 2024
18 checks passed

jayaddison deleted the precaution/validate-marleyspoon-api-domains branch May 6, 2024 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarleySpoon: add precautionary check for unexpected API URLs. #1069

MarleySpoon: add precautionary check for unexpected API URLs. #1069

jayaddison commented Apr 19, 2024

jayaddison commented Apr 19, 2024

jayaddison commented Apr 19, 2024

strangetom commented Apr 19, 2024

jayaddison commented Apr 22, 2024

jayaddison commented Apr 22, 2024

jayaddison commented Apr 30, 2024

jayaddison Apr 30, 2024

jayaddison commented May 3, 2024

MarleySpoon: add precautionary check for unexpected API URLs. #1069

MarleySpoon: add precautionary check for unexpected API URLs. #1069

Conversation

jayaddison commented Apr 19, 2024

jayaddison commented Apr 19, 2024

jayaddison commented Apr 19, 2024

strangetom commented Apr 19, 2024

jayaddison commented Apr 22, 2024

jayaddison commented Apr 22, 2024

jayaddison commented Apr 30, 2024

jayaddison Apr 30, 2024

Choose a reason for hiding this comment

jayaddison commented May 3, 2024