Add API for Request, SyncRequestData, and AsyncRequestData #151

sethmlarson · 2019-11-25T03:39:53Z

I started off the API design by removing everything under src/ and replacing it with only *.pyi files. We'll see how far I can push the API design while only using .pyi, if I need to switch to real .py files eventually that'll be fine I think?

This is the first commit which adds Request model along with the SyncRequestData and AsyncRequestData wrappers which are an API that will wrap all the types of data we accept for a request (bytes, str, files, file-like objects, iterators) which will be passed down to the HTTP lifecycle and eventually the low-level backend stream objects.

I'm tagging down my thoughts within the doc-strings. If you have additional comments I can add them inline as well

pquentin · 2019-11-25T12:22:56Z

@sethmlarson Thank you for this! It's detailed enough for us to think about actual usage, but also small enough for us to give 30% feedback. And the way you prepared the api-design branch makes the PR really easy to read. So, again, thank you!

I initially assumed that this was the API we were going to expose to users, but in fact it's our internal representation of requests after some preprocessing, right?

In #124 @njsmith mentions "low level" and "high level" APIs. I think we should explicitly state whether we're going to have those two APIs or not. I think a single high-level API is enough?

I'm asking this because low-level urllib3 uses body while high-level requests uses data as you're doing here. I think it's best to continue using the requests terminology now that we target the high-level features of requests, so 👍 on the name here.

Regarding request data, we looked in #135 at all the things requests support. I think the end result is either bytes, a file-like object that can be rewinded or just an iterator, and this is exactly what you're supporting in SyncRequestData, so that looks good!

For the async API, I don't know what would be best and will let @njsmith comment. Maybe the situation is similar to that of response data that we explored in #123?

Oh, and by the way, since we've removed all the code in src/urllib3, I think we can remove the travis and appveyor configuration files as they're currently distracting. (But probably not in this PR.)

sethmlarson · 2019-11-27T05:02:28Z

There'll probably be quite a few merge conflicts early on if I'm doing these concurrently so make sure these are squash-merged :)

sethmlarson · 2019-11-28T03:24:22Z

src/hip/models.pyi

+    Ref: https://github.com/python-trio/urllib3/issues/135
+    """
+
+    def read(self, nbytes: int) -> bytes:


Regarding there only being one interface, I think we should have something like SyncRequestData.sync_read() and AsyncRequestData.async_read() or maybe we switch to using async for chunk in request.data which would automatically get transformed by unasync but we'd forfeit the ability to grab a certain amount of data. I guess that interface is easier to implement anyways?

The reason I think this would be good to have the split is you'd only have to subclass AsyncRequestData and could technically use that class in both async and synchronous requests. Would just need to have good error reporting if you're passed input data that is for example, an async generator, and then using it in a sync session. I'm guessing that wouldn't happen often and an error message would fix any ambiguity.

Here's an example of my plan in action:

import typing StrOrInt = typing.Union[str, int] FormType = typing.Union[ typing.Sequence[typing.Tuple[str, StrOrInt]], typing.Mapping[str, typing.Union[StrOrInt, typing.Sequence[StrOrInt]]], ] def _int_to_urlenc(): values = {} special = {0x2A, 0x2D, 0x2E, 0x5F} for byte in range(256): if ( (0x61 <= byte <= 0x7A) or (0x030 <= byte <= 0x5A and byte != 0x40) or (byte in special) ): # Keep the ASCII values[byte] = bytes((byte,)) elif byte == 0x020: # Space -> '+' values[byte] = b"+" else: # Percent-encoded values[byte] = b"%" + hex(byte)[2:].upper().encode() return values INT_TO_URLENC = _int_to_urlenc() class URLEncoded(AsyncRequestData): """Implements x-www-form-urlencoded as a RequestData object""" def __init__(self, form: FormType): self._form = form self._data = None @property def content_type(self) -> str: return "application/x-www-form-urlencoded" @property def content_length(self) -> int: return len(self._encode_form()) def is_rewindable(self) -> bool: return True def __iter__(self) -> typing.Iterable[bytes]: yield self._encode_form() async def __aiter__(self) -> typing.AsyncIterable[bytes]: yield self._encode_form() def _encode_form(self) -> bytes: if self._data is None: def serialize(x: str) -> bytes: return b"".join([INT_TO_URLENC[byte] for byte in x.encode("utf-8")]) output: typing.List[bytes] = [] for k, vs in ( self._form.items() if hasattr(self._form, "items") else self._form ): if isinstance(vs, str) or not hasattr(vs, "__iter__"): vs = (vs,) for v in vs: output.append(serialize(k) + b"=" + serialize(v)) self._data = b"&".join(output) return self._data

Suddenly this one class can be used for both synchronous and asynchronous requests. We keep the is_rewindable (or maybe rename it to is_replayable? or something else) and then we'd signal a "rewind" by calling __aiter__ or __iter__ again.

sethmlarson · 2019-11-28T03:26:08Z

src/hip/models.pyi

+        in the case of a timeout/socket error on an idempotent request, a redirect,
+        etc so that the new request can be sent. This works for file-like objects
+        and bytes / strings (where it's a no-op).
+        """


Should also add a content_type property so that can be passed to the request as a default if none is set.

njsmith · 2019-12-01T00:05:20Z

What do you think about using something like this as our internal representation for request bodies?

class AsyncRequestData:
    content_length: Optional[int]
    async def data_chunks(self) -> AsyncIterable[bytes]:
       ...

So instead of having a stateful object that we rewind, it acts as a factory for iterators.

(I'm focusing on the async version because I'm assuming the sync version will be identical except for being unasync'ed.)

Random thoughts:

Checking the length of a file might be an async operation, but accessing content_length here is synchronous. So this means that when you're wrapping a file into one of these things, you'd have to check the length up front and stash it in the object. That seems fine to me.
For objects like files that actually use a stateful rewind-style API underneath, there's a hidden constraint: if you call data_chunks several times, then you can get several "independent" iterator objects, but in fact they're not independent; you have to make sure that you stop using the old iterator once the new one is created. In practice we'd never want to do this anyway, and I guess if we want to be extra-careful we could use some logic to invalidate old iterators when a new one is created.
If the underlying data isn't rewindable, I guess in this approach the way we'd handle that is that iter_data would raise an exception if you tried to call it a second time. Is that a good solution? Or are there cases where we really need to check can_rewind ahead of time, because we'll do something different depending on whether we anticipate rewinding being possible?
Do we need to have a close method on this thing? If you pass a file object to requests, does it close it when its done?

sethmlarson · 2019-12-01T01:12:33Z

What do you think about using something like this as our internal representation for request bodies?
So instead of having a stateful object that we rewind, it acts as a factory for iterators.

I'd rather go with this approach, the only issue I see is how an "unsynced" Session could use a single "Async/SyncRequestData"? I was proposing the interface with __iter__ and __aiter__ because we'd be able to rely on unasync transforming AsyncRequestData.__aiter__() into SyncRequestData.__iter__(). Would be beneficial to not have to have two implementations of each custom body type, I guess we could live with it though?

* Checking the length of a file might be an async operation, but accessing `content_length` here is synchronous. So this means that when you're wrapping a file into one of these things, you'd have to check the length up front and stash it in the object. That seems fine to me.

You're on the same track here as me.

* For objects like files that actually use a stateful rewind-style API underneath, there's a hidden constraint: if you call `data_chunks` several times, then you can get several "independent" iterator objects, but in fact they're not independent; you have to make sure that you stop using the old iterator once the new one is created. In practice we'd never want to do this anyway, and I guess if we want to be extra-careful we could use some logic to invalidate old iterators when a new one is created.

Definitely true.

* If the underlying data _isn't_ rewindable, I guess in this approach the way we'd handle that is that `iter_data` would raise an exception if you tried to call it a second time. Is that a good solution? Or are there cases where we really need to check `can_rewind` ahead of time, because we'll do something different depending on whether we anticipate rewinding being possible?

That's true, there's not much we can do here if the underlying data isn't rewindable. We can drop that section of the API.

* Do we need to have a `close` method on this thing? If you pass a file object to requests, does it close it when its done?

>>> import requests
>>> f = open("setup.py", "rb")
>>> requests.post("https://httpbin.org/anything", data=f)
<Response [200]>
>>> f.closed
False

Looks like no, maybe we don't need a close method either!

njsmith · 2019-12-01T01:30:49Z

I'd rather go with this approach, the only issue I see is how an "unsynced" Session could use a single "Async/SyncRequestData"? I was proposing the interface with __iter__ and __aiter__ because we'd be able to rely on unasync transforming AsyncRequestData.__aiter__() into SyncRequestData.__iter__(). Would be beneficial to not have to have two implementations of each custom body type, I guess we could live with it though?

Hmm, I was imagining we'd have something like:

class RequestData:
    async def data_chunks(self):
        if self._consumed:
            raise ...
        elif self._bytes is not None:
            yield self._bytes
       elif self._sync_iterable is not None:
           self._consumed = True
           for chunk in self._sync_iterable:
               yield chunk
       elif self._async_iterable is not None:
           self._consumed = True
           async for chunk in self._async_iterable:
               yield chunk
       elif self._sync_file is not None:
           # you get the idea

This would live at hip._async.RequestData and be used by hip._async.Session. And then unasync would take care of creating hip._sync.RequestData, and also creating hip._sync.Session that uses it. And I think everything would just work? Maybe I'm not understanding the question :-)

sethmlarson · 2019-12-01T03:17:34Z

The AsyncRequestData and SyncRequestData interface aren't the issue, that can be unasynced and will work fine. I'm more worried about having an interface that can be sub-classed into things like URLEncodedForm, MultipartFormData, etc and only having to be sub-classed once instead of under _async/. But I guess in the end it's the same amount of code to maintain thanks to unasync so shouldn't be a problem?

njsmith · 2019-12-01T05:52:55Z

Well, I'm one of those weirdos who refuses to use subclassing unless there's absolutely no other option :-). But assuming you mean that this might be an abstract interface that has multiple implementations, then yeah, unasync should just take care of that, I think?

sethmlarson · 2019-12-02T02:49:18Z

Closed in favor of #185.

Add API for Request, SyncRequestData, and AsyncRequestData

df4a019

sethmlarson force-pushed the request-api branch from 265fc3c to df4a019 Compare November 25, 2019 14:42

sethmlarson closed this Nov 26, 2019

sethmlarson reopened this Nov 26, 2019

Merge branch 'api-design' into request-api

9ccc0b0

sethmlarson requested a review from njsmith November 27, 2019 14:42

sethmlarson commented Nov 28, 2019

View reviewed changes

sethmlarson mentioned this pull request Dec 2, 2019

Add Request and RequestData APIs #185

Closed

sethmlarson closed this Dec 2, 2019

sethmlarson deleted the request-api branch December 4, 2019 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API for Request, SyncRequestData, and AsyncRequestData #151

Add API for Request, SyncRequestData, and AsyncRequestData #151

sethmlarson commented Nov 25, 2019

pquentin commented Nov 25, 2019

sethmlarson commented Nov 27, 2019

sethmlarson Nov 28, 2019

sethmlarson Nov 28, 2019

sethmlarson Nov 28, 2019

njsmith commented Dec 1, 2019

sethmlarson commented Dec 1, 2019

njsmith commented Dec 1, 2019

sethmlarson commented Dec 1, 2019

njsmith commented Dec 1, 2019

sethmlarson commented Dec 2, 2019

Add API for Request, SyncRequestData, and AsyncRequestData #151

Add API for Request, SyncRequestData, and AsyncRequestData #151

Conversation

sethmlarson commented Nov 25, 2019

pquentin commented Nov 25, 2019

sethmlarson commented Nov 27, 2019

sethmlarson Nov 28, 2019

Choose a reason for hiding this comment

sethmlarson Nov 28, 2019

Choose a reason for hiding this comment

sethmlarson Nov 28, 2019

Choose a reason for hiding this comment

njsmith commented Dec 1, 2019

sethmlarson commented Dec 1, 2019

njsmith commented Dec 1, 2019

sethmlarson commented Dec 1, 2019

njsmith commented Dec 1, 2019

sethmlarson commented Dec 2, 2019