Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial functionality. #4

Merged
merged 34 commits into from
May 9, 2024
Merged

Initial functionality. #4

merged 34 commits into from
May 9, 2024

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Mar 25, 2024

TODO:

  • Write tests for loading policy files
  • Decide which Scrapy component to use
  • Write tests involving Scrapy
  • Add basic documentation

@wRAR wRAR requested review from kmike, Gallaecio and BurnzZ March 25, 2024 16:17
Copy link
Member

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good so far!

duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved
duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved
tests/test_policies.py Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
duplicate_url_discarder/_rule.py Outdated Show resolved Hide resolved
tests/test_policies.py Outdated Show resolved Hide resolved
from duplicate_url_discarder.processor import Processor


class Component: # TODO
Copy link

@kmike kmike Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it recently, the idea was that we need to keep the list of canonicalized URLs inside the component to be able to update them with new learned rules so it can't be just a fingerprinter. Even if we do the learning part later we will still need to change the component?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the learning component update the rules in the fingerprinter?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, or is it an issue that the fingerprints are going to be changed during the spider run?

Copy link

@kmike kmike Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also keep in mind the memory requirements. During the long crawls, dupefilter often becomes one of the most memory-heavy components in Scrapy. If we start storing more (e.g. urls), it can become more of an issue, even without a learning part. There might be some room for optimization, maybe later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurnzZ I recall we discussed it before, but I don't recall the answer

It was a middleware since we would like to store and preserve the URL values so we can later re-canonicalize them after learning some new rules. Here's a crawling scenario that might paint a better picture:

  1. Spider requests https://example.com/product/123?ref=cat&node=32
    • Request successfully goes through.
  2. Spider requests https://example.com/product/456
    • Request successfully goes through.
  3. Spider requests https://example.com/product/123?ref=listing
  4. It has derived and adds ref to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.
  5. Spider requests https://example.com/product/123?node=789&ref=promo
  6. It has derived and adds node to the list of URL Query Parameters to ignore. This triggers an event to re-canonicalize the URLs that were previously visited.
  7. Spider requests https://example.com/product/123?node=3829&ref=home

Note that these re-canonicalization events could be costly in terms of compute. However, in practice, all of the possible URL Query parameters to be ignored are learned as early as less than a thousand requests. There's not much to learn later in the crawl that warrant to re-canonicalize the previously seen URLs. Though a tiny chance still exists that a re-canonicalization event would be triggered when the spider already had, say, a million Requests already.

Because of these need for re-canonicalization, we may use the Fingerprinter, but we can't avoid storing the URL values as well.

Lastly, we also want to only consider and store URLs from requests set with meta value of "dud": True, compared to enabling them by default, so that we can shave off from storing unrelated URLs and avoid re-canonicalizing them as well.

tests/test_processor.py Outdated Show resolved Hide resolved
tests/test_processor.py Outdated Show resolved Hide resolved
duplicate_url_discarder/processor.py Outdated Show resolved Hide resolved
duplicate_url_discarder/processor.py Outdated Show resolved Hide resolved
from duplicate_url_discarder.processor import Processor


class Component: # TODO
Copy link

@kmike kmike Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurnzZ I recall we discussed it before, but I don't recall the answer :) Why is this component a middleware, not a request fingerprinter? Could it be a fingerprinter if the learning part is made separate? What are the trade-offs?

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
tox.ini Outdated Show resolved Hide resolved
tox.ini Show resolved Hide resolved
if not policy_path:
raise NotConfigured("No DUD_LOAD_POLICY_PATH set")
self.processor = Processor(policy_path)
self.canonical_urls: Set[str] = set()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can estimate the memory usage of this approach somehow. If it's a lot, it may even make sense to have a separate code path when we know the fingerprints are not going to be updated (e.g. learning is not enabled, or learning is finished).

Copy link

@kmike kmike Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main worry is that with this implementation there is a non-zero chance zyte-spider-templates RAM usage may blow up above SC free unit limits (or 1 SC unit limit) in reasonably common cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips requests without the meta key, but where will we use that key? Probably for all normal requests?

As for memory usage I wanted to say it's comparable to the fingerprinter one but then I realized that URLs are often longer than fingerprints.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this. cc @BurnzZ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way we can save on RAM is to store this on disk, like using https://docs.python.org/3/library/shelve.html. Although this uses a dict-like interface, we can simply use the keys for uniqueness and leave the values empty.

Moreover, as a sidenote, there's a caveat to using shelve's .get() method, where it runs in O(n) (I learned this the hard way) Ref. Using something like this is faster:

try:
    return data_on_disk[key]
except KeyError:
    return None

This skips requests without the meta key

By the way, I still think it shouldn't :) There was a thread in the original proposal about this

Having an opt-in approach is indeed tedious for the user to set but allows to narrow down which types of URLs will be stored here, and thus, reducing the storage needed.

What do you think about having a similar approach with scrapy-zyte-api's TRANSPARENT_MODE while allowing "zyte_api" to be set in the meta. Users can use a setting that turns everything on but for zyte-spider-templates, we can manually set DUD via the meta for optimal usage.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a solution where the overhead is minimal and it's opt-out :) It seems this is achievable. Url matching looks optimized enough, and I'm optimistic RAM can also be optimized.

DIsc storage will trade off RAM for speed; it may not be a good trade here, but I'm not sure.

For URL storage there are also data structures like tries, which can save a lot of memory.

But on a first sight, it looks like if we can make an assumption that the fingeprints don't change, and have a separate code path for the case they can change, it all can be pretty optimal. In this case we may even think if the storate can be shared with the dupefilter, but that's a separate idea.

When the learning is enabled, it looks like optimization can be significantly harder. So, maybe we can have learning opt-in (maybe per-request), not the whole thing opt-in?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I don't think we must have a final optimized implementation in this PR to have it merged.

Addressing it (i.e. evaluating how big is the issue, and making necessary optimizations) is a blocker to make use of DUD in zyte-spider-templates by default.

Copy link

@kmike kmike Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we need to understand if an optimized version is possible to make a decision on having the component enabled by default for all request (+ a way to opt out per request?) vs having it opt-in per-request.

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
canonical_url = self.url_canonicalizer.process_url(request.url)
self.crawler.stats.inc_value("duplicate_url_discarder/request/processed")
return self._fallback_request_fingerprinter.fingerprint(
request.replace(url=canonical_url)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that it works properly when scrapy-zyte-api's fingerprinter is used as a fallback - is request.url even used in this case?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems request.url should be used, but the code is tricky; maybe we even should have a test for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What test(s) are you thinking of?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's an overkill. I was thinking about having a spider with scrapy-zyte-api addon enabled, and checking that the duplicate filtering is working as expected (i.e. both DUD and scrapy-zyte-api logic is respected).

@wRAR
Copy link
Member Author

wRAR commented May 6, 2024

So the next question is whether we want to copy the complicated fallback selection logic from https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/_request_fingerprinter.py (note that there is separate logic based on settings in https://github.com/scrapy-plugins/scrapy-zyte-api/blob/main/scrapy_zyte_api/addon.py which we also want in our add-on).

@Gallaecio
Copy link

Gallaecio commented May 6, 2024

Yes and no. I think we should:

  • Override whatever fingerprinter is configured by the time this add-on runs, and keep the previous value internally, no need for fallback settings.
  • Implement our fingerprinting method here as something like: request = canonicalize_request_url(request); return self.actual_fingerprinter(request).

@wRAR
Copy link
Member Author

wRAR commented May 7, 2024

keep the previous value internally, no need for fallback settings.

Then the add-on will be mandatory. But I think we already decided that it's fine.

The fallback fingerprinter will be the one set by e.g. scrapy-zyte-api (if the priorities are correct), the one set by the user in settings.py if no add-ons change it, or the Scrapy default one if nothing else defined it.

@wRAR
Copy link
Member Author

wRAR commented May 7, 2024

Implement our fingerprinting method here as something like: request = canonicalize_request_url(request); return self.actual_fingerprinter(request).

I think that's what we currently have.

README.rst Outdated Show resolved Hide resolved
@Gallaecio
Copy link

keep the previous value internally, no need for fallback settings.

Then the add-on will be mandatory. But I think we already decided that it's fine.

The fallback fingerprinter will be the one set by e.g. scrapy-zyte-api (if the priorities are correct), the one set by the user in settings.py if no add-ons change it, or the Scrapy default one if nothing else defined it.

I’m more than OK with keeping the setting, so that the add-on is not strictly necessary.

README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
duplicate_url_discarder/processors/query_removal.py Outdated Show resolved Hide resolved
duplicate_url_discarder/url_canonicalizer.py Outdated Show resolved Hide resolved
tests/test_fingerprinter.py Outdated Show resolved Hide resolved
@wRAR
Copy link
Member Author

wRAR commented May 8, 2024

If it's fine to add the add-on separately then I think this one is ready.

"order": 100,
"processor": "queryRemoval",
"urlPattern": {
"include": []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an example of universal pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member Author

@wRAR wRAR May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pattern is universal "if there are no include patterns or they are empty"

@wRAR wRAR merged commit 810eb02 into main May 9, 2024
8 checks passed
@wRAR wRAR deleted the add-code branch May 9, 2024 08:41
@wRAR
Copy link
Member Author

wRAR commented May 9, 2024

Thank you everyone!

@zytedata zytedata deleted a comment from BurnzZ May 14, 2024
@zytedata zytedata deleted a comment from BurnzZ May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants