Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON extra uses orjson instead of ujson #599

Closed
wants to merge 1 commit into from
Closed

JSON extra uses orjson instead of ujson #599

wants to merge 1 commit into from

Conversation

ijl
Copy link

@ijl ijl commented Jun 18, 2019

This is the minimum change to replace ujson. There are
additional calls to json.loads that may be modified.

The extra is renamed 'fast-json' to match 'email' and avoid further
changes.

#589

@ijl ijl mentioned this pull request Jun 18, 2019
setup.py Outdated
@@ -98,7 +98,7 @@ def extra(self):
'dataclasses>=0.6;python_version<"3.7"'
],
extras_require={
'ujson': ['ujson>=1.35'],
'json': ['orjson>=2.0.6,<3;platform_python_implementation=="CPython"'],
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this on PyPy, but the intent is to allow users to unconditionally install pydantic[json] without an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this on PyPy, but the intent is to allow users to unconditionally install pydantic[json] without an issue.

I think this might be confusing -- it might make someone think that for any json support they need to install pydantic[json]. I would vote to have it as 'orjson': ['orjson>=2.0.6,<3;platform_python_implementation=="CPython"']

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The intent was the target won't change if the implementation does. I've changed it to orjson.

@codecov
Copy link

codecov bot commented Jun 18, 2019

Codecov Report

Merging #599 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff          @@
##           master   #599   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          15     15           
  Lines        2628   2628           
  Branches      516    516           
=====================================
  Hits         2628   2628

@samuelcolvin
Copy link
Member

Looks good in principal,

Thoughts/questions:

  • should we make orjson a hard requirement rather than an optional requirement? Are there any environments where it won't work?
  • could we use orjson for model's .json() method? perhaps with a pretty kwarg which defaults to true and falls back to using standard json?
  • same for schema_json?

I'm guessing the date & datetime output format might change which could break things for lots of people. Is the performance improvement worth that?

@ijl
Copy link
Author

ijl commented Jun 18, 2019

orjson can be used to serialize as well (if not pretty-printing) but it requires outputting bytes instead of str and removing **dumps_kwargs (as incompatible between the two). I think these are both generally good changes whether orjson is used or not (bytes as it's a serialized blob and no pass-through kwargs to no leak json's details), but it's obviously breaking.

It looks like there's actually no difference in datetime output. datetime's isoformat output seems the same as RFC 3339 and pydantic's tests pass either way.

In terms of installing by default, the requirement in setup.py should be:

orjson>=2.0.6,<3;platform_python_implementation=="CPython" and platform_machine=="x86_64" and python_version<"3.8"

PyPy, ARM, and 3.8 could fall back to json. I think that would work fine. There are PyPI wheels for Linux, macOS, and Windows for 3.5, 3.6, and 3.7.

@samuelcolvin
Copy link
Member

All makes sense to me.

Given that orjson doesn't work with some platforms, let's leave it as an optional dependency for now.

We should remove **dumps_kwargs, and instead have pretty=False (sorry for the mistake in my reply above). If pretty=True we should use json.dumps(..., indent=2).

I'm less sure about returning bytes vs. str - I understand there would be a performance improvement to using bytes, specially as in most contexts bytes will be required in the end.

But this would be a breaking change and very unnatural for lots of people who are used to json.dumps. What do other people think? @tiangolo @dmontagu

@dmontagu
Copy link
Contributor

dmontagu commented Jun 18, 2019

I mostly agree with @samuelcolvin, with the following caveats:

  1. I think it might be nice to continue supporting fallback to json.dumps if other dumps_kwargs are used (since that allows things like custom encoders where necessary). But I think it's not a big deal either way.
  2. Whether returning bytes is supported or not, I don't think it should default to returning bytes -- that will be a source of pain for newcomers to the library and anyone with a more limited understanding of the difference in python (i.e., most python-as-a-second-language users). I think it could make sense to have a bytes argument to the .json function which defaults to False though.

@ijl
Copy link
Author

ijl commented Jun 19, 2019

Ok, well is there anything further that should be done as part of this PR?

@samuelcolvin
Copy link
Member

Ok, well is there anything further that should be done as part of this PR?

Yes.

On the name of the optional dependency let's use "fast-json" that way:

  • it clearly demonstrates what it's for
  • it's not tied to a particular implementation

Let's change model.json() and model.schema_json() to use orjson if possible, as suggested above let's:

  • stick to returning str
  • remove dumps_kwargs and instead add pretty=False, if pretty=True we'd use json.dumps(..., indent=2)

@tiangolo
Copy link
Member

Nice! I think it would be interesting to also support ORJSON.

Agree with you both. With the two concerns/caveats from @dmontagu .

@ijl
Copy link
Author

ijl commented Jun 28, 2019

I've renamed the extra to fast-json and rebased. I think the discussion has become about expanding the use of faster JSON library throughout the project, which is much larger. I only meant for this pull request to replace the existing usage of ujson with orjson

@lgtm-com
Copy link

lgtm-com bot commented Jun 28, 2019

This pull request introduces 1 alert when merging 156b77c into 3cdbbae - view on LGTM.com

new alerts:

  • 1 for Module-level cyclic import

This is the minimum change to replace ujson. There are
additional calls to json.loads that may be modified.

The extra is renamed 'fast-json' to match 'email' and avoid
further changes.
@samuelcolvin
Copy link
Member

hi, I think it would be best that we fix everything at the same time on the same PR, would you be willing to implement the features requested above?

@samuelcolvin
Copy link
Member

After further consideration, I have some other concerns about using orjson in pydantic:

  • The author of orjson, @ijl does not provide his name or any personal details on github, nor does he/she have a massive amount of github activity although the account has been active for years
  • @ijl privately emailed me asking me to integrate orjson into pydantic, I was surprised and somewhat worried by the hostile response I got when I made his/her request public
  • orjson is written in rust and distributed as a compiled binary meaning there's no way be sure what the package contains, or what is executed when you import/call it. In theory users could download the source, read the source, compile it themselves with the exact same setup then check the checksum - but I doubt anyone does that.

Individually all these things are fine and understandable but together they concern me.

Let me make it clear: I'm not accusing @ijl of anything, I'm 99% certain that his/her intentions are honourable.

However given all the pointers above, I can't help articles like this coming to mind.

I understand that (mostly through fastAPI) pydantic is used by some fairly big organisations, including Microsoft, making it a target for hackers. I don't want to be (inadvertently) responsible for a security breach, or do something which increases that possibility unduely.

I therefore won't be accepting this PR. Sorry.

@ijl ijl deleted the orjson-loads branch July 2, 2019 10:00
@ijl
Copy link
Author

ijl commented Jul 2, 2019

I'm at a loss. That's an extravagant claim I would never have expected.

@pablogamboa
Copy link

I think this decision needs to be reviewed. ujson's last release was 3.5 years ago. The fastest and better maintained python library is orjson right now. I think it would be quite beneficial to the community if there was a blessed way to use it on pydantic & FastAPI @samuelcolvin @tiangolo

@samuelcolvin
Copy link
Member

I'd still be prepared to migrate to orjson but somehow my concerns above would need to be resolved.

@pablogamboa, other than a long delay since last release do you have any specific problems with ujson?

@pablogamboa
Copy link

@samuelcolvin the biggest problem for me with ujson is that you cannot extend it and it has a so-or-so support for floats (you can see they have a few issues open around it).

orjson with its default hook solves the first issue. And then it has a very good support for floats and numbers. It has its own share of quirks, like the way they serialize utf-8 to bytes rather than escaped ASCII like other libraries, but I can live with them. The speed bump is definitely noticeable when you deal with large payloads!

I think that there's a very interesting synergy between the Rust and Python communities and that orjson is among the first packages that will dominate Python because of Rust's ridiculous performance.

Yes ilj is relatively unknown in the python community, but he's really onto something with orjson.

@samuelcolvin
Copy link
Member

I 100% agree that orjson looks great, that's why I created #589. I'm also entirely with you on the synergy between rust and python; there was a great talk on this at euro-python this year from a core python maintainer.

If I was starting pydantic from scratch today I'd definitely use orjson.

However there's a higher threshold of confidence required to migrate to a package when it's already being downloaded 200k/mo. Many users don't have time to go through every dependency change in every package update, so they have to outsource that part of security to package maintainers and trust them to make sensible decisions.

I simply can't switch to orjson without some more openness from the maintainer or a (kind of hostile) fork which I'm not prepared to do.

In summary nothing has changed since my comment above, so neither has my decision.

@pablogamboa
Copy link

Yep understood, I didn't ask you to switch tho! I asked for a blessed way to use it, i.e. make it somehow configurable to the user what json library to use. That'd be an amazing feature. For example Flask's JSONEncoder is great for an easy way to use a custom encoder in the framework (and my biggest grip with ujson!)

@samuelcolvin
Copy link
Member

makes sense. That sounds like it might be possible. Perhaps using orjson if a specific environment variable is set?

@pablogamboa
Copy link

An env variable would definitely work, another approach could be an api that allows you to override dumps/loads somehow. This would be a deadly addition! thanks for listening @samuelcolvin

@samuelcolvin
Copy link
Member

makes more sense, perhaps we could have dump_json and load_json attributes of Config which are then used in parse_json(), json() and schema_json() (and anywhere else we use encode/decode json that I've forgotten about).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants