JSON extra uses orjson instead of ujson #599

ijl · 2019-06-18T12:28:31Z

This is the minimum change to replace ujson. There are
additional calls to json.loads that may be modified.

The extra is renamed 'fast-json' to match 'email' and avoid further
changes.

ijl · 2019-06-18T12:30:55Z

setup.py

@@ -98,7 +98,7 @@ def extra(self):
        'dataclasses>=0.6;python_version<"3.7"'
    ],
    extras_require={
-        'ujson': ['ujson>=1.35'],
+        'json': ['orjson>=2.0.6,<3;platform_python_implementation=="CPython"'],


I haven't tested this on PyPy, but the intent is to allow users to unconditionally install pydantic[json] without an issue.

I haven't tested this on PyPy, but the intent is to allow users to unconditionally install pydantic[json] without an issue.

I think this might be confusing -- it might make someone think that for any json support they need to install pydantic[json]. I would vote to have it as 'orjson': ['orjson>=2.0.6,<3;platform_python_implementation=="CPython"']

Ok. The intent was the target won't change if the implementation does. I've changed it to orjson.

codecov · 2019-06-18T12:32:17Z

Codecov Report

Merging #599 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##           master   #599   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          15     15           
  Lines        2628   2628           
  Branches      516    516           
=====================================
  Hits         2628   2628

samuelcolvin · 2019-06-18T14:18:32Z

Looks good in principal,

Thoughts/questions:

should we make orjson a hard requirement rather than an optional requirement? Are there any environments where it won't work?
could we use orjson for model's .json() method? perhaps with a pretty kwarg which defaults to true and falls back to using standard json?
same for schema_json?

I'm guessing the date & datetime output format might change which could break things for lots of people. Is the performance improvement worth that?

ijl · 2019-06-18T16:35:56Z

orjson can be used to serialize as well (if not pretty-printing) but it requires outputting bytes instead of str and removing **dumps_kwargs (as incompatible between the two). I think these are both generally good changes whether orjson is used or not (bytes as it's a serialized blob and no pass-through kwargs to no leak json's details), but it's obviously breaking.

It looks like there's actually no difference in datetime output. datetime's isoformat output seems the same as RFC 3339 and pydantic's tests pass either way.

In terms of installing by default, the requirement in setup.py should be:

orjson>=2.0.6,<3;platform_python_implementation=="CPython" and platform_machine=="x86_64" and python_version<"3.8"

PyPy, ARM, and 3.8 could fall back to json. I think that would work fine. There are PyPI wheels for Linux, macOS, and Windows for 3.5, 3.6, and 3.7.

samuelcolvin · 2019-06-18T16:43:49Z

All makes sense to me.

Given that orjson doesn't work with some platforms, let's leave it as an optional dependency for now.

We should remove **dumps_kwargs, and instead have pretty=False (sorry for the mistake in my reply above). If pretty=True we should use json.dumps(..., indent=2).

I'm less sure about returning bytes vs. str - I understand there would be a performance improvement to using bytes, specially as in most contexts bytes will be required in the end.

But this would be a breaking change and very unnatural for lots of people who are used to json.dumps. What do other people think? @tiangolo @dmontagu

dmontagu · 2019-06-18T19:33:38Z

I mostly agree with @samuelcolvin, with the following caveats:

I think it might be nice to continue supporting fallback to json.dumps if other dumps_kwargs are used (since that allows things like custom encoders where necessary). But I think it's not a big deal either way.
Whether returning bytes is supported or not, I don't think it should default to returning bytes -- that will be a source of pain for newcomers to the library and anyone with a more limited understanding of the difference in python (i.e., most python-as-a-second-language users). I think it could make sense to have a bytes argument to the .json function which defaults to False though.

ijl · 2019-06-19T08:26:00Z

Ok, well is there anything further that should be done as part of this PR?

samuelcolvin · 2019-06-19T10:11:50Z

Ok, well is there anything further that should be done as part of this PR?

Yes.

On the name of the optional dependency let's use "fast-json" that way:

it clearly demonstrates what it's for
it's not tied to a particular implementation

Let's change model.json() and model.schema_json() to use orjson if possible, as suggested above let's:

stick to returning str
remove dumps_kwargs and instead add pretty=False, if pretty=True we'd use json.dumps(..., indent=2)

tiangolo · 2019-06-20T18:36:22Z

Nice! I think it would be interesting to also support ORJSON.

Agree with you both. With the two concerns/caveats from @dmontagu .

ijl · 2019-06-28T12:09:35Z

I've renamed the extra to fast-json and rebased. I think the discussion has become about expanding the use of faster JSON library throughout the project, which is much larger. I only meant for this pull request to replace the existing usage of ujson with orjson

lgtm-com · 2019-06-28T12:15:28Z

This pull request introduces 1 alert when merging 156b77c into 3cdbbae - view on LGTM.com

new alerts:

1 for Module-level cyclic import

This is the minimum change to replace ujson. There are additional calls to json.loads that may be modified. The extra is renamed 'fast-json' to match 'email' and avoid further changes.

samuelcolvin · 2019-07-02T08:56:34Z

hi, I think it would be best that we fix everything at the same time on the same PR, would you be willing to implement the features requested above?

samuelcolvin · 2019-07-02T09:39:05Z

After further consideration, I have some other concerns about using orjson in pydantic:

The author of orjson, @ijl does not provide his name or any personal details on github, nor does he/she have a massive amount of github activity although the account has been active for years
@ijl privately emailed me asking me to integrate orjson into pydantic, I was surprised and somewhat worried by the hostile response I got when I made his/her request public
orjson is written in rust and distributed as a compiled binary meaning there's no way be sure what the package contains, or what is executed when you import/call it. In theory users could download the source, read the source, compile it themselves with the exact same setup then check the checksum - but I doubt anyone does that.

Individually all these things are fine and understandable but together they concern me.

Let me make it clear: I'm not accusing @ijl of anything, I'm 99% certain that his/her intentions are honourable.

However given all the pointers above, I can't help articles like this coming to mind.

I understand that (mostly through fastAPI) pydantic is used by some fairly big organisations, including Microsoft, making it a target for hackers. I don't want to be (inadvertently) responsible for a security breach, or do something which increases that possibility unduely.

I therefore won't be accepting this PR. Sorry.

ijl · 2019-07-02T10:08:03Z

I'm at a loss. That's an extravagant claim I would never have expected.

pablogamboa · 2019-07-31T08:30:46Z

I think this decision needs to be reviewed. ujson's last release was 3.5 years ago. The fastest and better maintained python library is orjson right now. I think it would be quite beneficial to the community if there was a blessed way to use it on pydantic & FastAPI @samuelcolvin @tiangolo

samuelcolvin · 2019-07-31T08:39:02Z

I'd still be prepared to migrate to orjson but somehow my concerns above would need to be resolved.

@pablogamboa, other than a long delay since last release do you have any specific problems with ujson?

pablogamboa · 2019-07-31T17:21:45Z

@samuelcolvin the biggest problem for me with ujson is that you cannot extend it and it has a so-or-so support for floats (you can see they have a few issues open around it).

orjson with its default hook solves the first issue. And then it has a very good support for floats and numbers. It has its own share of quirks, like the way they serialize utf-8 to bytes rather than escaped ASCII like other libraries, but I can live with them. The speed bump is definitely noticeable when you deal with large payloads!

I think that there's a very interesting synergy between the Rust and Python communities and that orjson is among the first packages that will dominate Python because of Rust's ridiculous performance.

Yes ilj is relatively unknown in the python community, but he's really onto something with orjson.

samuelcolvin · 2019-07-31T17:32:56Z

I 100% agree that orjson looks great, that's why I created #589. I'm also entirely with you on the synergy between rust and python; there was a great talk on this at euro-python this year from a core python maintainer.

If I was starting pydantic from scratch today I'd definitely use orjson.

However there's a higher threshold of confidence required to migrate to a package when it's already being downloaded 200k/mo. Many users don't have time to go through every dependency change in every package update, so they have to outsource that part of security to package maintainers and trust them to make sensible decisions.

I simply can't switch to orjson without some more openness from the maintainer or a (kind of hostile) fork which I'm not prepared to do.

In summary nothing has changed since my comment above, so neither has my decision.

pablogamboa · 2019-07-31T17:45:28Z

Yep understood, I didn't ask you to switch tho! I asked for a blessed way to use it, i.e. make it somehow configurable to the user what json library to use. That'd be an amazing feature. For example Flask's JSONEncoder is great for an easy way to use a custom encoder in the framework (and my biggest grip with ujson!)

samuelcolvin · 2019-07-31T17:47:35Z

makes sense. That sounds like it might be possible. Perhaps using orjson if a specific environment variable is set?

pablogamboa · 2019-07-31T18:01:53Z

An env variable would definitely work, another approach could be an api that allows you to override dumps/loads somehow. This would be a deadly addition! thanks for listening @samuelcolvin

samuelcolvin · 2019-07-31T18:34:57Z

makes more sense, perhaps we could have dump_json and load_json attributes of Config which are then used in parse_json(), json() and schema_json() (and anywhere else we use encode/decode json that I've forgotten about).

ijl mentioned this pull request Jun 18, 2019

switch from ujson to orjson #589

Closed

ijl commented Jun 18, 2019

View reviewed changes

JSON extra uses orjson instead of ujson

17dbb3d

This is the minimum change to replace ujson. There are additional calls to json.loads that may be modified. The extra is renamed 'fast-json' to match 'email' and avoid further changes.

samuelcolvin closed this Jul 2, 2019

ijl deleted the orjson-loads branch July 2, 2019 10:00

samuelcolvin mentioned this pull request Aug 5, 2019

Allow custom json encoder / decoder #714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON extra uses orjson instead of ujson #599

JSON extra uses orjson instead of ujson #599

ijl commented Jun 18, 2019 •

edited

ijl Jun 18, 2019

dmontagu Jun 18, 2019

ijl Jun 19, 2019

codecov bot commented Jun 18, 2019 •

edited

samuelcolvin commented Jun 18, 2019

ijl commented Jun 18, 2019

samuelcolvin commented Jun 18, 2019

dmontagu commented Jun 18, 2019 •

edited

ijl commented Jun 19, 2019

samuelcolvin commented Jun 19, 2019

tiangolo commented Jun 20, 2019

ijl commented Jun 28, 2019

lgtm-com bot commented Jun 28, 2019

samuelcolvin commented Jul 2, 2019

samuelcolvin commented Jul 2, 2019

ijl commented Jul 2, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

JSON extra uses orjson instead of ujson #599

JSON extra uses orjson instead of ujson #599

Conversation

ijl commented Jun 18, 2019 • edited

ijl Jun 18, 2019

Choose a reason for hiding this comment

dmontagu Jun 18, 2019

Choose a reason for hiding this comment

ijl Jun 19, 2019

Choose a reason for hiding this comment

codecov bot commented Jun 18, 2019 • edited

Codecov Report

samuelcolvin commented Jun 18, 2019

ijl commented Jun 18, 2019

samuelcolvin commented Jun 18, 2019

dmontagu commented Jun 18, 2019 • edited

ijl commented Jun 19, 2019

samuelcolvin commented Jun 19, 2019

tiangolo commented Jun 20, 2019

ijl commented Jun 28, 2019

lgtm-com bot commented Jun 28, 2019

samuelcolvin commented Jul 2, 2019

samuelcolvin commented Jul 2, 2019

ijl commented Jul 2, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

pablogamboa commented Jul 31, 2019

samuelcolvin commented Jul 31, 2019

ijl commented Jun 18, 2019 •

edited

codecov bot commented Jun 18, 2019 •

edited

dmontagu commented Jun 18, 2019 •

edited