Initial functionality. #4

wRAR · 2024-03-25T16:17:01Z

TODO:

Write tests for loading policy files
Decide which Scrapy component to use
Write tests involving Scrapy
Add basic documentation

duplicate_url_discarder/policies/base.py

duplicate_url_discarder/processor.py

BurnzZ

looking good so far!

duplicate_url_discarder/_rule.py

tests/test_policies.py

duplicate_url_discarder/policies/base.py

README.rst

duplicate_url_discarder/_rule.py

tests/test_policies.py

kmike · 2024-04-07T07:21:41Z

duplicate_url_discarder/scrapy_component.py

+from duplicate_url_discarder.processor import Processor
+
+
+class Component:  # TODO


@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?

We discussed it recently, the idea was that we need to keep the list of canonicalized URLs inside the component to be able to update them with new learned rules so it can't be just a fingerprinter. Even if we do the learning part later we will still need to change the component?

Can't the learning component update the rules in the fingerprinter?

Ah, or is it an issue that the fingerprints are going to be changed during the spider run?

We should also keep in mind the memory requirements. During the long crawls, dupefilter often becomes one of the most memory-heavy components in Scrapy. If we start storing more (e.g. urls), it can become more of an issue, even without a learning part. There might be some room for optimization, maybe later.

@BurnzZ I recall we discussed it before, but I don't recall the answer

It was a middleware since we would like to store and preserve the URL values so we can later re-canonicalize them after learning some new rules. Here's a crawling scenario that might paint a better picture:

Spider requests https://example.com/product/123?ref=cat&node=32

Request successfully goes through.

Spider requests https://example.com/product/456

Request successfully goes through.

Spider requests https://example.com/product/123?ref=listing

Request successfully goes through.

It notices that it extracted the same product similar to that of https://example.com/product/123?ref=cat&node=32.

It has derived and adds ref to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.

Previously Seen URLs (Before):

https://example.com/product/123?ref=cat&node=32

https://example.com/product/456

https://example.com/product/123?ref=listing

Previously Seen URLs (After) ⸺ ref removed:

https://example.com/product/123?node=32

https://example.com/product/456

https://example.com/product/123

Spider requests https://example.com/product/123?node=789&ref=promo

URL is canonicalized to https://example.com/product/123?node=789

Since this new URL does not match anything from the Previously Seen URLs, the Request pushes through. It could be the case that node affects something in the results and thus, we need to see if it results in the same Product.

Note that the original https://example.com/product/123?node=789 is used for downloading not the canonicalized version.

However, the canonicalized URL of https://example.com/product/123?node=789 is the one stored in the Previously Seen URLs.

After receiving the response, it notices that the same product is extracted to that of these canonicalized Previously seen URLs:

https://example.com/product/123?node=32

https://example.com/product/123

It has derived and adds node to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.

Previously Seen URLs (Before):

https://example.com/product/123?node=32

https://example.com/product/456

https://example.com/product/123

https://example.com/product/123?node=789

Previously Seen URLs (After) ⸺ node removed:

https://example.com/product/123

https://example.com/product/456

https://example.com/product/123 (may be removed since it already exists)

https://example.com/product/123 (may be removed since it already exists)

Spider requests https://example.com/product/123?node=3829&ref=home

URL is canonicalized to https://example.com/product/123

Since this canonical URL has been seen before, the request is filtered out.

Note that these re-canonicalization events could be costly in terms of compute. However, in practice, all of the possible URL Query parameters to be ignored are learned as early as less than a thousand requests. There's not much to learn later in the crawl that warrant to re-canonicalize the previously seen URLs. Though a tiny chance still exists that a re-canonicalization event would be triggered when the spider already had, say, a million Requests already.

Because of these need for re-canonicalization, we may use the Fingerprinter, but we can't avoid storing the URL values as well.

Lastly, we also want to only consider and store URLs from requests set with meta value of "dud": True, compared to enabling them by default, so that we can shave off from storing unrelated URLs and avoid re-canonicalizing them as well.

tests/test_processor.py

duplicate_url_discarder/processor.py

kmike · 2024-04-07T07:21:41Z

duplicate_url_discarder/scrapy_component.py

+from duplicate_url_discarder.processor import Processor
+
+
+class Component:  # TODO


@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?

README.rst

tox.ini

kmike · 2024-04-15T12:26:52Z

duplicate_url_discarder/middlewares.py

+        if not policy_path:
+            raise NotConfigured("No DUD_LOAD_POLICY_PATH set")
+        self.processor = Processor(policy_path)
+        self.canonical_urls: Set[str] = set()


I wonder if we can estimate the memory usage of this approach somehow. If it's a lot, it may even make sense to have a separate code path when we know the fingerprints are not going to be updated (e.g. learning is not enabled, or learning is finished).

My main worry is that with this implementation there is a non-zero chance zyte-spider-templates RAM usage may blow up above SC free unit limits (or 1 SC unit limit) in reasonably common cases.

This skips requests without the meta key, but where will we use that key? Probably for all normal requests?

As for memory usage I wanted to say it's comparable to the fingerprinter one but then I realized that URLs are often longer than fingerprints.

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this. cc @BurnzZ

One way we can save on RAM is to store this on disk, like using https://docs.python.org/3/library/shelve.html. Although this uses a dict-like interface, we can simply use the keys for uniqueness and leave the values empty.

Moreover, as a sidenote, there's a caveat to using shelve's .get() method, where it runs in O(n) (I learned this the hard way) Ref. Using something like this is faster:

try: return data_on_disk[key] except KeyError: return None

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this

Having an opt-in approach is indeed tedious for the user to set but allows to narrow down which types of URLs will be stored here, and thus, reducing the storage needed.

What do you think about having a similar approach with scrapy-zyte-api's TRANSPARENT_MODE while allowing "zyte_api" to be set in the meta. Users can use a setting that turns everything on but for zyte-spider-templates, we can manually set DUD via the meta for optimal usage.

I'd prefer a solution where the overhead is minimal and it's opt-out :) It seems this is achievable. Url matching looks optimized enough, and I'm optimistic RAM can also be optimized.

DIsc storage will trade off RAM for speed; it may not be a good trade here, but I'm not sure.

For URL storage there are also data structures like tries, which can save a lot of memory.

But on a first sight, it looks like if we can make an assumption that the fingeprints don't change, and have a separate code path for the case they can change, it all can be pretty optimal. In this case we may even think if the storate can be shared with the dupefilter, but that's a separate idea.

When the learning is enabled, it looks like optimization can be significantly harder. So, maybe we can have learning opt-in (maybe per-request), not the whole thing opt-in?

By the way, I don't think we must have a final optimized implementation in this PR to have it merged.

Addressing it (i.e. evaluating how big is the issue, and making necessary optimizations) is a blocker to make use of DUD in zyte-spider-templates by default.

Also, it seems we need to understand if an optimized version is possible to make a decision on having the component enabled by default for all request (+ a way to opt out per request?) vs having it opt-in per-request.

README.rst

duplicate_url_discarder/fingerprinter.py

kmike · 2024-04-27T16:15:10Z

duplicate_url_discarder/fingerprinter.py

+        canonical_url = self.url_canonicalizer.process_url(request.url)
+        self.crawler.stats.inc_value("duplicate_url_discarder/request/processed")
+        return self._fallback_request_fingerprinter.fingerprint(
+            request.replace(url=canonical_url)


We need to check that it works properly when scrapy-zyte-api's fingerprinter is used as a fallback - is request.url even used in this case?

It seems request.url should be used, but the code is tricky; maybe we even should have a test for it.

What test(s) are you thinking of?

Maybe it's an overkill. I was thinking about having a spider with scrapy-zyte-api addon enabled, and checking that the duplicate filtering is working as expected (i.e. both DUD and scrapy-zyte-api logic is respected).

wRAR · 2024-05-06T12:41:43Z

So the next question is whether we want to copy the complicated fallback selection logic from https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/_request_fingerprinter.py (note that there is separate logic based on settings in https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/addon.py which we also want in our add-on).

Gallaecio · 2024-05-06T13:18:19Z

Yes and no. I think we should:

Override whatever fingerprinter is configured by the time this add-on runs, and keep the previous value internally, no need for fallback settings.
Implement our fingerprinting method here as something like: request = canonicalize_request_url(request); return self.actual_fingerprinter(request).

wRAR · 2024-05-07T12:55:26Z

keep the previous value internally, no need for fallback settings.

Then the add-on will be mandatory. But I think we already decided that it's fine.

The fallback fingerprinter will be the one set by e.g. scrapy-zyte-api (if the priorities are correct), the one set by the user in settings.py if no add-ons change it, or the Scrapy default one if nothing else defined it.

wRAR · 2024-05-07T12:56:13Z

Implement our fingerprinting method here as something like: request = canonicalize_request_url(request); return self.actual_fingerprinter(request).

I think that's what we currently have.

README.rst

Gallaecio · 2024-05-07T15:08:47Z

keep the previous value internally, no need for fallback settings.

Then the add-on will be mandatory. But I think we already decided that it's fine.

The fallback fingerprinter will be the one set by e.g. scrapy-zyte-api (if the priorities are correct), the one set by the user in settings.py if no add-ons change it, or the Scrapy default one if nothing else defined it.

I’m more than OK with keeping the setting, so that the add-on is not strictly necessary.

README.rst

duplicate_url_discarder/__init__.py

duplicate_url_discarder/_rule.py

duplicate_url_discarder/processors/query_removal.py

duplicate_url_discarder/url_canonicalizer.py

tests/test_fingerprinter.py

wRAR · 2024-05-08T14:57:11Z

If it's fine to add the add-on separately then I think this one is ready.

kmike · 2024-05-08T16:43:27Z

README.rst

+        "order": 100,
+        "processor": "queryRemoval",
+        "urlPattern": {
+          "include": []


Is it an example of universal pattern?

A pattern is universal "if there are no include patterns or they are empty"

duplicate_url_discarder/fingerprinter.py

wRAR · 2024-05-09T08:41:36Z

Thank you everyone!

Initial functionality.

40b6d51

wRAR requested review from kmike, Gallaecio and BurnzZ March 25, 2024 16:17

kmike reviewed Mar 25, 2024

View reviewed changes

duplicate_url_discarder/policies/base.py Outdated Show resolved Hide resolved

kmike reviewed Mar 25, 2024

View reviewed changes

duplicate_url_discarder/processor.py Outdated Show resolved Hide resolved

BurnzZ reviewed Mar 26, 2024

View reviewed changes

duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved

duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved

tests/test_policies.py Outdated Show resolved Hide resolved

wRAR added 3 commits March 26, 2024 12:44

Rename PolicyRule to UrlRule.

1f6cc87

Move to/from_dict into the UrlRule class.

de962c6

Improve QueryRemovalPolicy.

ffb68cf

wRAR commented Mar 26, 2024

View reviewed changes

duplicate_url_discarder/policies/base.py Outdated Show resolved Hide resolved

wRAR added 6 commits March 26, 2024 19:38

Tests for Processor.

ab98bf2

Basic readme.

e932c1f

Rename a method.

44e8e66

Add .coveragerc.

de00aea

Update the URL matching logic.

dac0b16

Set the min url-matcher version.

20bc9d8

kmike reviewed Apr 3, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

BurnzZ reviewed Apr 7, 2024

View reviewed changes

kmike reviewed Apr 7, 2024

View reviewed changes

wRAR added 4 commits April 8, 2024 21:30

Small improvements.

6fd9b91

Use match_universal().

233a62f

Skip duplicate rules.

5fcd156

Update the component.

7f442a7

kmike reviewed Apr 15, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 15, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 15, 2024

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

kmike reviewed Apr 15, 2024

View reviewed changes

tox.ini Show resolved Hide resolved

kmike reviewed Apr 15, 2024

View reviewed changes

Update the url-matcher version.

e335437

Replace the middleware with the fingerprinter.

570eafc

kmike reviewed Apr 27, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 27, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 27, 2024

View reviewed changes

duplicate_url_discarder/fingerprinter.py Outdated Show resolved Hide resolved

kmike reviewed Apr 27, 2024

View reviewed changes

Fixes.

ac5a9b0

Gallaecio reviewed May 7, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

Gallaecio reviewed May 7, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

Gallaecio reviewed May 7, 2024

View reviewed changes

README.rst Outdated Show resolved Hide resolved

Gallaecio reviewed May 7, 2024

View reviewed changes

duplicate_url_discarder/__init__.py Outdated Show resolved Hide resolved

Gallaecio reviewed May 7, 2024

View reviewed changes

duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved

Gallaecio approved these changes May 7, 2024

View reviewed changes

BurnzZ reviewed May 8, 2024

View reviewed changes

duplicate_url_discarder/processors/query_removal.py Outdated Show resolved Hide resolved

duplicate_url_discarder/url_canonicalizer.py Outdated Show resolved Hide resolved

tests/test_fingerprinter.py Outdated Show resolved Hide resolved

Fixes.

14f63c8

kmike reviewed May 8, 2024

View reviewed changes

duplicate_url_discarder/fingerprinter.py Outdated Show resolved Hide resolved

kmike reviewed May 8, 2024

View reviewed changes

duplicate_url_discarder/fingerprinter.py Outdated Show resolved Hide resolved

kmike approved these changes May 8, 2024

View reviewed changes

Rename the fingerprinter.

ac6421a

wRAR merged commit 810eb02 into main May 9, 2024
8 checks passed

wRAR deleted the add-code branch May 9, 2024 08:41

zytedata deleted a comment from BurnzZ May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial functionality. #4

Initial functionality. #4

wRAR commented Mar 25, 2024 •

edited

BurnzZ left a comment

kmike Apr 7, 2024 •

edited

wRAR Apr 7, 2024

kmike Apr 7, 2024

kmike Apr 7, 2024

kmike Apr 7, 2024 •

edited

BurnzZ Apr 7, 2024

kmike Apr 7, 2024 •

edited

kmike Apr 15, 2024

kmike Apr 15, 2024 •

edited

wRAR Apr 15, 2024

kmike Apr 15, 2024

BurnzZ Apr 16, 2024

kmike Apr 16, 2024

kmike Apr 16, 2024

kmike Apr 16, 2024 •

edited

kmike Apr 27, 2024

kmike Apr 27, 2024

wRAR May 6, 2024

kmike May 6, 2024

wRAR commented May 6, 2024

Gallaecio commented May 6, 2024 •

edited

wRAR commented May 7, 2024

wRAR commented May 7, 2024

Gallaecio commented May 7, 2024

wRAR commented May 8, 2024

kmike May 8, 2024

wRAR May 9, 2024

wRAR May 9, 2024 •

edited

wRAR commented May 9, 2024

		from duplicate_url_discarder.processor import Processor


		class Component: # TODO

Initial functionality. #4

Initial functionality. #4

Conversation

wRAR commented Mar 25, 2024 • edited

BurnzZ left a comment

Choose a reason for hiding this comment

kmike Apr 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Apr 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Apr 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Apr 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wRAR commented May 6, 2024

Gallaecio commented May 6, 2024 • edited

wRAR commented May 7, 2024

wRAR commented May 7, 2024

Gallaecio commented May 7, 2024

wRAR commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wRAR May 9, 2024 • edited

Choose a reason for hiding this comment

wRAR commented May 9, 2024

wRAR commented Mar 25, 2024 •

edited

kmike Apr 7, 2024 •

edited

kmike Apr 7, 2024 •

edited

kmike Apr 7, 2024 •

edited

kmike Apr 15, 2024 •

edited

kmike Apr 16, 2024 •

edited

Gallaecio commented May 6, 2024 •

edited

wRAR May 9, 2024 •

edited