Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json parameter now accepts a bytes instance for pre-serialized json. #1730

Closed
wants to merge 1 commit into from

Conversation

aviramha
Copy link

@aviramha aviramha commented Jul 1, 2021

This is a take on #1352 #717 . I think it offers a pretty good way to provide easy functionality, without global hacks.

@tomchristie
Copy link
Member

Appealing in some ways, but also a bit awkwardly nuanced, given that Python can often be a little careless between when bytes and str is used.

For example, these two examples look pretty identical on the surface...

import httpx, json

httpx.post(url, json=json.dumps(data))
import httpx, ojson

httpx.post(url, json=ojson.dumps(data))

But would actually result in completely different behaviour. (One case happens to output a str, then other bytes)

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

It's a good point.
How about creating a new class that inherits from bytes for marking this usecase?

class SerializedJSON(bytes):
    pass
httpx.post(url, json=SerializedJSON(json.dumps("asdsad")))

This is a real blocker for us so I'm a bit pushy :p

@tomchristie
Copy link
Member

This is a real blocker for us

Okay, noted.

So... this PR only addresses the request encoding, but wouldn't alter response.json().
Do you need both?

Also, do you have any kind of metrics that've informed this use case. Eg. how large are the json documents that you're posting? Do you have a good demo example that shows that the serialization is significant vs. the actual network time it takes to upload?

Typically the network would be the limiting factor anyways, but if we've got some good evidence that it's not the case for some reasonable use-cases, then that'd be a valid reason to prioritise #1352 / #717.

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

We have dictionaries sized 100-500 kilobytes. Those take about 10x the time using regular json, and also converting the complex object to a dictionary by itself takes a huge amount of time, rather then the serialization that does it on the fly. I think any benchmark of dataclass in orjson is a good example for our performance gain from using it.

@tomchristie
Copy link
Member

tomchristie commented Jul 2, 2021

I think any benchmark of dataclass in orjson is a good example for our performance gain from using it.

Very possibly. Although it is relevant to know if the JSON serialisation speed is significant vs. the actual network sending itself. Typically the network speed is the actual limiting factor, and improvements in serialisation speed might look relevant if measured in isolation, but might not necessarily be relevant in the bigger picture.

I'm not asserting that is or isn't the case here, but I haven't seen any definitive evidence either one way or the other.

A good test example might be to measure a large JSON request being posted to the httpbin service, and seeing what impact orjson has there, because that'd be a more meaningful metric?...

(I'm absolutely open to being persuaded here, I'm just pushing back until we've really made a the case for it.)

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

Regarding response.json this can be easily bypassed by using response.content, unlike the request where it creates a very annoying pattern.
Last comment was from phone so I didn't put a good enough example.
Take the following:

import json
import orjson
from dataclasses import dataclass, asdict

@dataclass
class A:
    a: str
@dataclass
class B:
    a: A
    b: A
    c: A
b = B(a=A(a="Asd"), b=A(a="asd"), c=A(a="lasd"))
%timeit orjson.dumps(asdict(b))
11.8 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit json.dumps(asdict(b))
15.6 µs ± 154 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit orjson.dumps(b)
479 ns ± 2.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This example shows that it's not about who serializes a dict faster (although orjson does it better) but the fact that serialization happens in one operation, rather then conversion then serialization.

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

I actually disagree that serialization shouldn't be a factor on it's own even if it's not significant to the network limit. If you can cut 30 ms from either side, why would it matter? Especially when the 30ms (30ms as an example ofc) is CPU bound and not IO bound and you actually save "CPU money"...
In our scenario (microservices communication) the network time is between 1-5ms but serialization of complex objects requires use of a function similar to jsonable_encoder of fastapi that can take on it's own 500ms. It could be we can make a better jsonable_encoder of our own, but still, orjson does the serialization and conversion all in one, reducing a lot of cycles between C & Python.

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

Here's a benchmark of a local httpbin:

from typing import List
import orjson
import dataclasses


@dataclasses.dataclass
class Member:
    id: int
    active: bool


@dataclasses.dataclass
class Object:
    id: int
    name: str
    members: List[Member]


objects_as_dataclass = [
    Object(i, str(i) * 3, [Member(j, True) for j in range(0, 10)])
    for i in range(100000, 102000)
]

%timeit httpx.post("http://localhost:9001/post", content=orjson.dumps(objects_as_dataclass))
247 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit httpx.post("http://localhost:9001/post", json=[dataclasses.asdict(obj) for obj in objects_as_dataclass])
371 ms ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

(I'm absolutely open to being persuaded here, I'm just pushing back until we've really made a the case for it.)

Just to be clear, I'm entirely up for your scrutiny. You're (and the whole encode team) are doing an amazing job in keeping a very high standard set of libraries, and this cannot happen without being strict and following real life examples that are proved and not niche cases, that complicate the codebase. I wish more open source groups would be as professional.

@tomchristie
Copy link
Member

tomchristie commented Jul 2, 2021

Fantastic, thanks!

Can we also see the following?...

%timeit httpx.post("http://localhost:9001/post", json=[orjson.dumps(dataclasses.asdict(obj)) for obj in objects_as_dataclass])

It's interesting to me if the upshot of this is that the primary motivation here is "orjson can directly serialize dataclass instances, which is fast". (Which is a somewhat different motivation to "orjson is faster than json")

@tomchristie
Copy link
Member

I think the sensible outcome of this is likely to be that Client(...) accepts a jsonlib=... argument, and have the Request(...) and Response(...) classes also do the same.

I'm not wild about the global approach in #1352, so if we've got a proven-enough use case here, then let's do this properly, right?

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

The example you wrote doesn't make sense so I assume you meant for something like this:

%timeit httpx.post("http://localhost:9001/post", content=orjson.dumps([dataclasses.asdict(obj) for obj in objects_as_dataclass]))
346 ms ± 9.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think a jsonlib argument is not sufficient, as different json libraries might have different options (for example, we use the orjson numpy options that can be passed).
What about accepting json_dumps and json_loads? Maybe a NamedTuple of both functions..?
Let met know what you prefer and I'll alter the PR to match it.

@aviramha
Copy link
Author

aviramha commented Jul 2, 2021

imo the signature should be something like

def json_encode(json: Any) -> bytes

def json_decode(data: Union[str, bytes]) -> Any:

@tomchristie
Copy link
Member

I think a jsonlib argument is not sufficient, as different json libraries might have different options

The docs on https://github.com/encode/httpx/pull/1352/files#diff-0f0c40de44a0841ed5180f16d66212d1cece2a7d4400789007c2f73bf8432622 detail how you'd be able to handle that.

(Tho #1352 proposes a global approach)

@aviramha
Copy link
Author

aviramha commented Jul 5, 2021

I'll be a bit picky and say that it'll force double encoding.. but ok 😅

@tomchristie
Copy link
Member

So, all of the proposed options still leave me with awkward niggly-uncomfortablness.

I've also got the seed of some thoughts in #1740 which again is pushing me away from wanting to add extra API in this area.

Right now, the way I'd currently work with your use-case is probably something like this...

class CustomClient(httpx.Client):
    def request(self, *args, **kwargs):
        if 'json' in kwargs:
            kwargs['content'] = orjson.dumps(kwargs.pop('json'))
            kwargs.setdefault('headers', {})
            kwargs['headers']['Content-Type'] = 'application/json'
        super().request(*args, **kwargs)

It's also interesting having talked through your requirements here that the primary benefit orjson is giving you is not the faster json rendering that it provides (which is noticable, but not the biggest factor), but the fact that it provides json rendering directly from dataclass instances. (Which is ~kinda different.)

@bentheiii
Copy link

I think the sensible outcome of this is likely to be that Client(...) accepts a jsonlib=... argument, and have the Request(...) and Response(...) classes also do the same.

This seems like a good solution, though I think it would be better if the jsonlib argument was divvied into two arguments: json_serializer and json_deserializer. This is how sqlalchemy, pydantic, and flask handle custom serializations.

@stale
Copy link

stale bot commented Feb 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 19, 2022
@stale stale bot closed this Feb 27, 2022
@tscheburaschka
Copy link

I know, this might be outdated but I thought I can add one more use case here.
When switching from requests to httpx for sending time-series data I realized, that the JSON-encoder happily allows NaN and infinity as float values (basically by not altering the default value of the allow_nan parameter in stdlibs json.dump) in contrast to requests standard behavior.
So having an easy way to control the options of json encoding (even if not changing the library altogether) would be a convenient feature here. Until now I have to do nasty patching or heavy-weight sub-classing adding more stack layers.

@Kludex Kludex reopened this Dec 29, 2022
@Kludex Kludex added wontfix and removed wontfix labels Dec 29, 2022
@Kludex Kludex closed this Dec 29, 2022
@zanieb zanieb mentioned this pull request Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants