-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial functionality. #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good so far!
from duplicate_url_discarder.processor import Processor | ||
|
||
|
||
class Component: # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed it recently, the idea was that we need to keep the list of canonicalized URLs inside the component to be able to update them with new learned rules so it can't be just a fingerprinter. Even if we do the learning part later we will still need to change the component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't the learning component update the rules in the fingerprinter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, or is it an issue that the fingerprints are going to be changed during the spider run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also keep in mind the memory requirements. During the long crawls, dupefilter often becomes one of the most memory-heavy components in Scrapy. If we start storing more (e.g. urls), it can become more of an issue, even without a learning part. There might be some room for optimization, maybe later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BurnzZ I recall we discussed it before, but I don't recall the answer
It was a middleware since we would like to store and preserve the URL values so we can later re-canonicalize them after learning some new rules. Here's a crawling scenario that might paint a better picture:
- Spider requests https://example.com/product/123?ref=cat&node=32
- Request successfully goes through.
- Spider requests https://example.com/product/456
- Request successfully goes through.
- Spider requests https://example.com/product/123?ref=listing
- Request successfully goes through.
- It notices that it extracted the same product similar to that of https://example.com/product/123?ref=cat&node=32.
- It has derived and adds
ref
to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.- Previously Seen URLs (Before):
- Previously Seen URLs (After) ⸺
ref
removed:
- Spider requests https://example.com/product/123?node=789&ref=promo
- URL is canonicalized to https://example.com/product/123?node=789
- Since this new URL does not match anything from the Previously Seen URLs, the Request pushes through. It could be the case that
node
affects something in the results and thus, we need to see if it results in the same Product.- Note that the original https://example.com/product/123?node=789 is used for downloading not the canonicalized version.
- However, the canonicalized URL of https://example.com/product/123?node=789 is the one stored in the Previously Seen URLs.
- After receiving the response, it notices that the same product is extracted to that of these canonicalized Previously seen URLs:
- It has derived and adds
node
to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.- Previously Seen URLs (Before):
- Previously Seen URLs (After) ⸺
node
removed:- https://example.com/product/123
- https://example.com/product/456
- https://example.com/product/123 (may be removed since it already exists)
- https://example.com/product/123 (may be removed since it already exists)
- Spider requests https://example.com/product/123?node=3829&ref=home
- URL is canonicalized to https://example.com/product/123
- Since this canonical URL has been seen before, the request is filtered out.
Note that these re-canonicalization events could be costly in terms of compute. However, in practice, all of the possible URL Query parameters to be ignored are learned as early as less than a thousand requests. There's not much to learn later in the crawl that warrant to re-canonicalize the previously seen URLs. Though a tiny chance still exists that a re-canonicalization event would be triggered when the spider already had, say, a million Requests already.
Because of these need for re-canonicalization, we may use the Fingerprinter, but we can't avoid storing the URL values as well.
Lastly, we also want to only consider and store URLs from requests set with meta value of "dud": True
, compared to enabling them by default, so that we can shave off from storing unrelated URLs and avoid re-canonicalizing them as well.
from duplicate_url_discarder.processor import Processor | ||
|
||
|
||
class Component: # TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?
if not policy_path: | ||
raise NotConfigured("No DUD_LOAD_POLICY_PATH set") | ||
self.processor = Processor(policy_path) | ||
self.canonical_urls: Set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can estimate the memory usage of this approach somehow. If it's a lot, it may even make sense to have a separate code path when we know the fingerprints are not going to be updated (e.g. learning is not enabled, or learning is finished).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that with this implementation there is a non-zero chance zyte-spider-templates RAM usage may blow up above SC free unit limits (or 1 SC unit limit) in reasonably common cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips requests without the meta key, but where will we use that key? Probably for all normal requests?
As for memory usage I wanted to say it's comparable to the fingerprinter one but then I realized that URLs are often longer than fingerprints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips requests without the meta key
By the way, I still think it shouldn't :) There was a thread in the original proposal about this. cc @BurnzZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way we can save on RAM is to store this on disk, like using https://docs.python.org/3/library/shelve.html. Although this uses a dict-like interface, we can simply use the keys for uniqueness and leave the values empty.
Moreover, as a sidenote, there's a caveat to using shelve's .get()
method, where it runs in O(n)
(I learned this the hard way) Ref. Using something like this is faster:
try:
return data_on_disk[key]
except KeyError:
return None
This skips requests without the meta key
By the way, I still think it shouldn't :) There was a thread in the original proposal about this
Having an opt-in approach is indeed tedious for the user to set but allows to narrow down which types of URLs will be stored here, and thus, reducing the storage needed.
What do you think about having a similar approach with scrapy-zyte-api's TRANSPARENT_MODE
while allowing "zyte_api"
to be set in the meta. Users can use a setting that turns everything on but for zyte-spider-templates, we can manually set DUD via the meta for optimal usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer a solution where the overhead is minimal and it's opt-out :) It seems this is achievable. Url matching looks optimized enough, and I'm optimistic RAM can also be optimized.
DIsc storage will trade off RAM for speed; it may not be a good trade here, but I'm not sure.
For URL storage there are also data structures like tries, which can save a lot of memory.
But on a first sight, it looks like if we can make an assumption that the fingeprints don't change, and have a separate code path for the case they can change, it all can be pretty optimal. In this case we may even think if the storate can be shared with the dupefilter, but that's a separate idea.
When the learning is enabled, it looks like optimization can be significantly harder. So, maybe we can have learning opt-in (maybe per-request), not the whole thing opt-in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I don't think we must have a final optimized implementation in this PR to have it merged.
Addressing it (i.e. evaluating how big is the issue, and making necessary optimizations) is a blocker to make use of DUD in zyte-spider-templates by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems we need to understand if an optimized version is possible to make a decision on having the component enabled by default for all request (+ a way to opt out per request?) vs having it opt-in per-request.
canonical_url = self.url_canonicalizer.process_url(request.url) | ||
self.crawler.stats.inc_value("duplicate_url_discarder/request/processed") | ||
return self._fallback_request_fingerprinter.fingerprint( | ||
request.replace(url=canonical_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check that it works properly when scrapy-zyte-api's fingerprinter is used as a fallback - is request.url even used in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems request.url should be used, but the code is tricky; maybe we even should have a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What test(s) are you thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's an overkill. I was thinking about having a spider with scrapy-zyte-api addon enabled, and checking that the duplicate filtering is working as expected (i.e. both DUD and scrapy-zyte-api logic is respected).
So the next question is whether we want to copy the complicated fallback selection logic from https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/_request_fingerprinter.py (note that there is separate logic based on settings in https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/addon.py which we also want in our add-on). |
Yes and no. I think we should:
|
Then the add-on will be mandatory. But I think we already decided that it's fine. The fallback fingerprinter will be the one set by e.g. scrapy-zyte-api (if the priorities are correct), the one set by the user in settings.py if no add-ons change it, or the Scrapy default one if nothing else defined it. |
I think that's what we currently have. |
I’m more than OK with keeping the setting, so that the add-on is not strictly necessary. |
If it's fine to add the add-on separately then I think this one is ready. |
"order": 100, | ||
"processor": "queryRemoval", | ||
"urlPattern": { | ||
"include": [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an example of universal pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pattern is universal "if there are no include patterns or they are empty"
Thank you everyone! |
TODO: