Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove _next_expiration field from CookieJar #7790

Open
1 task done
dmoklaf opened this issue Nov 5, 2023 · 17 comments
Open
1 task done

Remove _next_expiration field from CookieJar #7790

dmoklaf opened this issue Nov 5, 2023 · 17 comments

Comments

@dmoklaf
Copy link

dmoklaf commented Nov 5, 2023

Is your feature request related to a problem?

The _next_expiration field seems unused in CookieJar. It is only written to and never read in any part of aiohttp:

self._next_expiration = next_whole_second()

Describe the solution you'd like

Remove the field if it's useless

Describe alternatives you've considered

Maybe that it was intended for something that was never completely coded (e.g.,fast removal of expired cookies).

Related component

Client

Additional context

No response

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@dmoklaf
Copy link
Author

dmoklaf commented Nov 5, 2023

Indeed this _next_expiration field was used to perform fast removal of expired cookies, but a posterior commits deleted the line of code that was using it:

if self._next_expiration > now:

Since that line of code has been deleted, _next_expiration is useless now (and the cleaning of cookies is slower as it doesn't use it to cut short the clear process).

The commit that deleted this line of code is
#5249 by @derlih

As an alternative to removing _next_expiration entirely, this line of code could be reintroduced. However this is more difficult now, as this commit made the _do_expiration code to be removed, and call instead the clear method, which now serves additional needs, like allowing custom selection of cookies to be cleared through a callback function. To revert this, the old _do_expiration code shall be reinstated, instead of delegating the work to the clear method.

More largely, frequent execution paths (_do_expiration is called very often across the entire API) shall have simple code benefiting from shortcuts (like the _next_expiration field to determine if the expiration process can be skipped) and rare paths (like a clear operation with a custom selection function) could be isolated into dedicated code.

@Dreamsorcerer
Copy link
Member

#7784 is also looking at removing duplicate calls to _do_expiration(). @bdraco Any opinion on whether this is worth reimplementing, or if it becomes a bit redundant after your changes?

@bdraco
Copy link
Member

bdraco commented Nov 6, 2023

Ideally the _next_expiration functionality would be restored.

In Home Assistant we see a lot of time spent in filter_cookies because sessions are shared between integrations which is aggravated by _do_expiration being called multiple times currently. Home Assistant tends to have a lot of cookies built up in the cookie jar which would make the restoration of the _next_expiration functionality quite helpful for performance since most calls to _do_expiration would return almost immediately.

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

Same here - I use aiohttp for a large-scale crawler. It works wonder, excepted the time it spends processing the cookie jar. The code seems complex and, potentially, not optimized for the frequent case (expiring cookies before selecting the set of active cookies to send to a website) but for rare cases (clearing cookies based on a custom logic).

@bdraco
Copy link
Member

bdraco commented Nov 6, 2023

py-spy of a request. Purple area is the time spent filtering cookies

Screenshot 2023-11-06 at 8 08 19 AM

Zoomed in
Screenshot 2023-11-06 at 8 08 57 AM

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

How many cookies were in the jar? I have several thousands (across a couple hundred domains)

@bdraco
Copy link
Member

bdraco commented Nov 6, 2023

I don't have a good way to inspect that as its from a live running instance, but I can clearly say no where near that many. I'd ballpark it to be 50-100 at most. I expect for your use case you probably spend the bulk of the request time in the cookie jar.

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

Yep that's what I see. The cookie jar is not scalable, specifically due to the expiration logic.
I didn't see this through py-spy but when switching the crawler from a DummyCookieJar to the standard CookieJar

@Dreamsorcerer
Copy link
Member

Sounds like reintroducing it is the correct approach then.

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

In fact the entire logic could be switched in favor of a time-ordered expiry queue. That way, there is never any need to loop over every cookie. Only the ones in front of the queue (the ones due to expire soon) need to be checked. The loop is exited as soon as the next iterated cookie from that queue is not expired yet. The heapq standard library provides the priority queue algorithm for free.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Nov 6, 2023

Well, feel free to come up with a proposal, if it's clearly better without too much complexity, then we can merge it.
Just bear in mind a 3rd PR currently in progress, that affects cookie performance is #7777.

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

Sounds great, I will wait for #7777 to be merged and then, depending on if it resolved this issue or not and the new capabilities it offers, make a proposal.

@Dreamsorcerer
Copy link
Member

It doesn't change the expirations, but will reduce the number of cookies being looked at in filter_cookies(). Mainly just want to make sure you don't end up with a load of merge conflicts.

@dmoklaf
Copy link
Author

dmoklaf commented Nov 6, 2023

I understand, I will find the best path.

@bdraco
Copy link
Member

bdraco commented Nov 12, 2023

In case its help to test against. Below is a basic benchmark (can likely be improved)

import timeit
from http import cookies

from yarl import URL

from aiohttp import CookieJar


def filter_large_cookie_jar():
    """Filter out large cookies from the cookie jar."""
    jar = CookieJar()
    c = cookies.SimpleCookie()
    domain_url = URL("http://maxagetest.com/")
    other_url = URL("http://otherurl.com/")

    for i in range(5000):
        cookie_name = f"max-age-cookie{i}"
        c[cookie_name] = "any"
        c[cookie_name]["max-age"] = 60
        c[cookie_name]["domain"] = "maxagetest.com"
    jar.update_cookies(c, domain_url)
    assert len(jar) == 5000
    assert len(jar.filter_cookies(domain_url)) == 5000
    assert len(jar.filter_cookies(other_url)) == 0

    filter_domain = timeit.timeit(lambda: jar.filter_cookies(domain_url), number=1000)
    filter_other_domain = timeit.timeit(
        lambda: jar.filter_cookies(other_url), number=1000
    )
    print(f"filter_domain: {filter_domain}")
    print(f"filter_other_domain: {filter_other_domain}")


filter_large_cookie_jar()

@dmoklaf
Copy link
Author

dmoklaf commented Nov 16, 2023

This is a great benchmark - will use it and publish sample measures for future reference

@Dreamsorcerer
Copy link
Member

Sounds great, I will wait for #7777 to be merged and then, depending on if it resolved this issue or not and the new capabilities it offers, make a proposal.

PR is merged now, so feel free to propose further changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants