Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

How to track session-related state beyond cookies? #141

Open
njsmith opened this issue Nov 18, 2019 · 7 comments
Open

How to track session-related state beyond cookies? #141

njsmith opened this issue Nov 18, 2019 · 7 comments

Comments

@njsmith
Copy link
Member

njsmith commented Nov 18, 2019

@sethmlarson points out that there are a bunch of bits of state that HTTP clients might want to track across otherwise-independent requests:

  • cookies
  • caches
  • altsvc
  • permanent redirects
  • hsts
  • (what else?)

Traditionally http client libs handle cookies, but mostly drop the other stuff on the floor. Maybe we should have a more comprehensive strategy. I think he's thought about this some, so I'll let him fill it more details :-)

@sethmlarson
Copy link
Contributor

Gonna add TLSv1.3 / QUIC session resumption tickets to the list of items that can be stored.

@sethmlarson
Copy link
Contributor

TLS session resumption doesn't work for Python on TLS 1.3 so we'd only get to use session resumption for TLS 1.2 (Still good!)
Docs: https://docs.python.org/3/library/ssl.html#ssl.SSLSession
The ticket has a few properties, unfortunately none of them are documented:

id
time
timeout
ticket_lifetime_hint
has_ticket

aioquic also has session tickets in this file, look for class SessionTicket. They have the following properties:

age_add: int
cipher_suite: CipherSuite
not_valid_after: datetime.datetime
not_valid_before: datetime.datetime
resumption_secret: bytes
server_name: str
ticket: bytes

max_early_data_size: Optional[int] = None
other_extensions: List[Tuple[int, bytes]] = field(default_factory=list)

Will need to figure out the parallels between the two to see if they can still be stored in the same group. For now though aioquic is a ways off, I'll focus on only the TLS session tickets.

@njsmith
Copy link
Member Author

njsmith commented Nov 28, 2019

I realized I was really unclear on when it was safe to re-use session tickets, e.g. if TLS settings changed, so I asked around and got a pretty good answer:

<davidben> Session resumption typically carries over the authentication from the previous connection. So, yeah, if you have two connections with different client cert preferences, they should not share sessions.
<davidben> Other bits of parameter negotiation may also be short-circuited, but it depends on the server. (E.g. OpenSSL will short-circuit cipher suite negotiation but we fixed that in BoringSSL at some point. It interacts badly with ticket renewal and server config changes.)
<njs> davidben: so it sounds like possibly a session ticket cache should just key off of "all TLS-related configuration"?
<davidben> njs: It's probably a bit excessive, but not wrong either.
[...]
<davidben> Right but since you already have to suitably partition your per-request config bits for connection reuse, the simplest thing is to partition the session resumption similarly.
[...]
<njs> yeah, if the answer is "treat session tickets like they're connections" then that's straightforward enough

So it sounds like session tickets probably should get handled at the same level as the connection pooling – which I guess will be a lower level "HTTP transport" object, not a higher-level "session with cookies and stuff" object?

I guess optimally, HSTS and Alt-Svc should be handled at this level as well. It doesn't make sense to stop enforcing HSTS just because you switched to a different set of cookies! And it makes sense architecturally too, because the "HTTP transport" is responsible for TLS and protocol negotiation, and that's what tickets/HSTS/Alt-Svc are all about. (Versus cookies, caches, redirects, which are all defined at the level of HTTP's abstract semantics, and don't care about transport details.)

(Or is it better to treat HSTS as an automatic redirect? I.e., if you request http://foo.com and HSTS says that you should actually fetch https://foo.com, then what happens? Does it fail, do you silently upgrade to HTTPS internally, do you upgrade to HTTPS internally + return response metadata as if the user had asked for https:// in the first place, how does it interact with redirect tracking APIs? I guess I need to read the spec or something...)

@njsmith
Copy link
Member Author

njsmith commented Nov 28, 2019

I guess something else to say explicitly here is: re-using a connection is also an example of state that gets shared between otherwise unrelated requests!

Assuming we do go with a design that has a low-level "transport" that handles connection-level stuff + higher-level "sessions" on top, then we'll need to document that for users who are serious about privacy/anonymization and want to make sure servers can't link together different requests, then they need to use Tor + use different "transport" objects for each pseudonymous identity. Simply switching "session" objects won't be enough to stop an attacker from correlating requests.

@sethmlarson
Copy link
Contributor

There's also this RFC we have to deal with when it comes to TLS session resumption: https://tools.ietf.org/html/rfc8470

Doesn't seem too bad, except we need to signal to the lifecycle somehow that a session is resumed and add the Early-Data: 1 header and respond to the 425 status code (and allow typically unsafe methods to be retransmitted).

@njsmith
Copy link
Member Author

njsmith commented Dec 1, 2019

[This started as a comment on https://github.com//pull/159 but then I realized that it was really a general discussion of session state rather than anything much to do with that design sketch in particular, so I decided to post it here instead. But for context, that PR has a sketch of a session store with methods like get_cookie, get_altsvc, get_redirect, get_cached_response, etc.]

I guess the biggest architectural question here is whether we want a single uber-session-store that holds all this data, or separate objects to hold different kinds of state.

Above, I suggested that we might want to handle Alt-Svc/HSTS/session-tickets at the connection pool level, rather than at the session level, so that would argue for splitting them off from cookies/redirects/caching.

Also, I think permanent redirects are a special case of a cached response? Based on this SO thread, it sounds like that's how browsers treat them. In particular, if a 301 has explicit cache control headers, then those are respected and can override the normal "permanent" caching. So maybe we just need a response caching API, and can drop the redirect caching part.

Cookies are also somewhat special: all the other bits of state we're talking about here are fundamentally optimizations – they can make things faster or more secure, but if you lose the state then it doesn't really matter; HTTP's semantics are unaffected. But for cookies, this isn't true; they have a massive effect on HTTP semantics. I'm not sure what the consequences of this are.

Okay, let's turn it around and think top-down for a bit. I'm tentatively imagining an architecture like this:

  • There's a low-level "transport" object that you can ask to handle a single completely-specified HTTP transaction. Internally it handles the details like HTTP/1.1 vs HTTP/2, connection pools, etc. There's a default global transport, and most programs will just use that for everything. This is the part that cares about alt-svc and session-tickets, and maybe about HSTS (I don't know enough about HSTS to be certain).

    When I say a "completely specified request" I mean method/headers/body, but also a giant pile of configuration options that are passed in independently for each request; the idea is that the transport should act mostly stateless to the outside. Obviously it does have internal state, like cached connections and so forth, but ideally there's no user-visible configuration settings at the object level; instead that's all passed in on each request.

  • There's a high-level interface hip.request (and its trivial wrappers hip.get, hip.post, etc.), that take an even bigger pile of configuration.

    One of the config options at this level is transport=..., which defaults to the global transport. hip.request's job is to lower each high-level request into a set of low-level requests on that transport.

    Another piece of configuration at this level is the cookie and cache databases. If you call hip.request directly, then by default the databases are null (reads return empty, writes are discarded).

  • Then there's a Session, which basically just holds a bunch of default kwargs to pass to request, and has its own request, get, etc. methods. Something like:

    class Session:
        async def request(self, ..., **config):
            for key, value in self.session_config:
                config.setdefault(key, value)
            return await hip.request(..., **config)

    So we end up with multiple levels of defaults: hip.request has one set of defaults, and then a Session can have its own defaults set that override those, and then if the user passes explicit options to Session.request then those override the Session defaults.

    So that's all very straightforward, but there's a subtlety I haven't mentioned. If you think of Session as being a bag of config overrides, then you'd expect session = hip.Session() to give you an object with no overrides, i.e. session_config would be set to {}, so hip.Session().request(...) should be exactly the same as hip.request(...). But! That's not what folks expect for cookies. So instead we want to do something like:

    class Session:
        def __init__(self, **session_config):
            if "cookiejar" not in session_config:
                session_config["cookiejar"] = MemoryCookieJar()
            self.session_config = session_config

Does that all make sense? I feel like it's one of those things that's conceptually pretty simple but trying to explain it makes it seem super complicated.

So let's consider a few different user scenarios:

  • User doesn't want to think: they just call hip.get and accept all the defaults. In this case I think we want to persist Alt-Svc/HSTS/session tickets in memory, but not on disk, and we don't want to do any persisting of cookies or caching. Rationale:

    • Alt-Svc/session tickets speed things up and are generally semantically transparent, so it's nice to have them persisted. It would be nice to persist them on disk, but as a library that's just not viable to do by default, so we settle for persisting them in memory.
    • HSTS improves security, if and only if you persist it. So again we want to persist it, but we can't persist it on disk by default, so we settle for persisting it in memory.
    • Accidentally leaking cookies between unrelated requests has major security implications, so we should only do that when the user has explicitly linked together two requests somehow.
    • Response caches are big and have complex policy questions, so there's no way to set a good default. We can't put stuff on disk, but we can't guess how much memory the user wants to spend on an in-memory cache either. So we have to disable it unless there's some explicit configuration.
  • User doesn't want to think too hard, but they do need to pass cookies between requests: We handle this by telling them to use hip.Session(), and otherwise accept all the defaults. The behavior is identical to that of hip.request, except now they get in-memory cookie persistence.

    Rationale: Users often need to persist cookies between requests (that's like, the whole point of cookies!), so we want to make this super easy to do. Using a Session object is a familiar way to do that, and it lets us link together related requests, so it solves the big problem that stopped us from persisting cookies inside hip.request. And cookies are generally tiny, so storing them in memory is fine; but we still can't persist them on disk without more information from the user.

  • User doesn't want to deal with on-disk persistence, but they do want to add some caching (because that's the one thing that we haven't already given them a way to do). I guess for this they make a hip.Session, and pass some kind of kwarg to configure an in-memory cache?

  • User is ready to cope with on-disk persistence, and wants the browser-like behavior where you say "here's a directory to stick stuff in, you get to use up to N megabytes" and then the software automatically does something intelligent with that space. This is really complicated, because for this you really have to consider all these different kinds of persistence together! They're linked by wanting to use a single setting to configure all of them, and your cache eviction policy has to somehow enforce a common limit for all of them. (Though I guess a simple approach would be to hard-code some ratios, like "we'll reserve 1% for session tickets, 5% for cookies, ..."?) This might be complex enough that we don't even try for now... though it would be nice if we could leave the door open to supporting it in the future?

  • User is a finicky power user, and wants to individual control everything. I guess any API will work for this, because if we provide a bunch of separate knobs for different features then power users can set those, and if we just provide a single big "use this session store" knob, then power users can define a custom session store that does whatever they want.

Hmm. I feel like I understand the problem better, but not like I'm necessarily any closer to answering the original question about what the API should look like :-).

@sethmlarson
Copy link
Contributor

100% agree that AltSvc, TLSTickets, and HSTS can be handled at the connection manager level. We do sacrifice a bit on keeping HSTS only in memory but not much, and websites that really care about their HSTS will end up on the preload list via hstspreload package anyways. :)

I'm not entirely convinced about having hip.Session.request() call hip.request(), to me it feels like it should be the other way around ie

def request(method, url, <request options>, <session_options>):
    session = Session(<session options>)
    return session.request(method, url, <request options>)

Maybe we do want to drive home the point that hip.request() is totally legit to use, not just for toying around like previous HTTP clients, as long as you don't need persistence? I don't know if users will ever know?

The session object seems like a good place to hold onto the transport and the cookies/cookiejar and cache stores. If we've narrowed it down to only these two databases at the Session level maybe we have a parameter for each instead of a unified interface :)

A thought on in-memory cache for the "User doesn't want disk but does want caching" use-case: Because it's so easy to configure in-memory (maybe via Session(cache=":memory:")) we can have a policy on the in-memory cache that it doesn't store responses with bodies. That way the storage can be optimized and users will benefit from cached redirects at the least.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants