Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement encode/decode hooks that allow to define custom serialization formats #209

Closed
wants to merge 5 commits into from

Conversation

vdmit11
Copy link

@vdmit11 vdmit11 commented Feb 4, 2016

This pull request proposes encoding/decoding hooks, that is,custom Python functions that you may pass to ujson.encode()/ujson.decode() and transform objects on the fly during the encoding/decoding process.

Basically this is a customization feature. Especially useful to override certain behavior already implemented by the ujson code, like datetime serialization or __json__() method handling (see below).

The hook behavior is similar to json.dumps(default=fn) and json.loads(object_hook=fn) semantics, you just pass a function that takes an object and possibly replaces it with another object. But unlike the default hook (that is called at the end) I implemented the pre_encode_hook (that is called before any other transformations are made), and this is done intentionally to allow user to override certain ujson's behavior inside the hook.

See the README.rst and tests.py for usage examples.

Motivation:

  • Customization features are demanded for long time, see Add support for custom encoders #124, Support for custom date serialization? #139.

    People are making their own forks and do minor changes in the C code to customize serialization for various types. That should be solved by extending the Python API.

    Hooks may be a good tradeoff solution (not as good as an object-oriented API where you can override methods and so on, but still ok).

  • The current ujson's implementation translates date and datetime objects into timestamps, and there is no option for switching to another format (like ISO 8601).

    Actually the format is not defined by JSON, so any hardcoded implementation is doomed.

    I guess the pull request Serialise datetime and date objects #201 is related here: it defines another hard-coded format, so it doesn't solve this problem, but still may be useful as a performance optimization for this specific format.

  • The __json__() method also has a problem: ujson expects this method to return a string containing valid JSON, while the usual approach (or at least how I know it) is to return an object like a dictionary from the __json__() method.

    Recently I had noticeable issues because of that when I tried to switch to ujson on one of projects. And the toDict() method is not always the solution, since you don't always return a dictionary. And I think generally the ujson should try to be compatible with already existing code (ideally, a drop-in replacement), so there should be an option for changing the __json__() semantics without changing tons of ORM classes that implement this method.

    And if you change the ujson's behavior to this "usual" approach, you will break the backwards compatibility. Again, there is no standard option here, so it would be better to let people to override this behavior with an arbitrary implementation.

So, now I'm asking someone to review the changes and make some comments.
Especially pay some attention to reference counting, I'm not sure that I handled it correctly.

Thanks!

This commit introduces several hooks for encoding/decoding functions:
 - ujson.encode(pre_encode_hook=func)
 - ujson.decode(object_hook=func)
 - ujson.decode(string_hook=func)

These hooks may be used for defining custom JSON representations for
arbitray object types.

The pre_encode_hook is similar to the standard JSONSerializer.default()
method. But unlike the default() method, the pre_encode_hook() is called
before any other encoding attempts, and thus the hook allows to override
some ujson behavior, like the __json__() method or datetime format.

The object_hook semantics is similar to the standard
json.loads(object_hook). It allows to deserialize JS objects into
specific Python class instances instead of generic dictionaries.

The string_hook is like object_hook, but called for every parsed string,
and thus it allows to deserialize strings into Python objects.
This may be useful for dates and other objects commonly represented
as strings.

This is the very first prototype implementation, so more commits
will follow.
@Jahaja
Copy link
Contributor

Jahaja commented Feb 8, 2016

Hi Dmitry, Thanks for the PR. Have you made pre and post benchmarks when this is not used?

@vdmit11
Copy link
Author

vdmit11 commented Feb 8, 2016

Just made the benchmarks, and got the following results:

Python2/pre:
ujson encode: 3630.95 calls/sec
ujson decode: 26398.61 calls/sec

Python2/post:
ujson encode: 3619.23 calls/sec
ujson decode: 26555.77 calls/sec

Python3/pre:
ujson encode: 3624.32 calls/sec
ujson decode: 25542.67 calls/sec

Python3/post:
ujson encode: 3601.77 calls/sec
ujson decode: 25418.21 calls/sec

Full logs are available here:
benchmark.log.txt
benchmark_results_only.txt

The difference is less than 1%, so I can conclude that there is no significant impact on the performance.

problem: basically the NULL pointer dereference
When an exception happens in a Python function, you get NULL from
PyObject_CallFunctionObjArgs() (instead of a valid PyObject pointer).
The value then is dereferenced and you get in trouble.

So this commit adds the NULL check with corresponding unit tests.

Also it adds few minor fixups:
 - Don't execute internal ujson's API callbacks (callPreEncodeHook() and
   callObjectHook()) when these hooks are not actually passed from the
   Python code. This is a very minor optimization, saves one C function
   call per object.
 - Get rid of the GET_EP() macro. It is used only once and therefore
   is redundant.
The pre_encode_hook() is called for every single JSON value.
This is slow, especially if you have a lot of objects and complicated
hook function.

This performance overhead may be reduced if you don't call the
pre_encode_hook() for primitive types like Number and String
(since usually you don't need to handle these types).

So this commits makes this behavior default. By default you get
good performance, but the hook is not called for primitive types.
and this behavior can be disabled like this:
  ujson.encode(pre_encode_primitive=True)
@mitar
Copy link
Contributor

mitar commented Mar 13, 2016

I think we should just then change __json__ to __ujson__, if you are observing backwards compatibility issues. But having it be able to output JSON string directly is very useful for cases where you do not want to convert GIS JSON into a dict, just so that it can again be serialized.

@mitar
Copy link
Contributor

mitar commented Mar 13, 2016

Or __raw_json__?

@greedo
Copy link

greedo commented Mar 23, 2016

👍 I would really like to use ultrajson and I need these hooks for datetime strings

@mayerwin
Copy link

Excellent!
For some inspiration, I recommend taking a look at how the excellent ServiceStack.Text library (for .NET) implements this through the following hooks:

SerializeFn
DeSerializeFn
RawSerializeFn
RawDeSerializeFn

@marc1n
Copy link

marc1n commented Apr 21, 2016

Why this PR is not merged yet? Is anybody here???
This is crutial PR in order to use ujson in serious projects!
@Jahaja Please merge this PR as soon as possible!

@jd-boyd
Copy link

jd-boyd commented Jul 21, 2016

I note that @Jahaja hasn't done anything publicly on github for 29 days now. It looks like the most recent activity from anyone in the organization was @SneWs from 17 days ago.

@zeitos
Copy link

zeitos commented Dec 23, 2016

Hello?

@vdmit11
Copy link
Author

vdmit11 commented Jan 1, 2017

Hi guys.

I have bad news: these changes should NOT be used, because there is a memory leak in there.

I didn't investigate it properly, but what I know is that recently we discovered a memory leak on a project where ujson (with this custom-hooks-feature) is involved. And basically we disabled ujson completely and the problem gone.

And since ujson was previously tested by a lot of users, and there was no memory leak, then I can conclude that the problem is introduced by me.

So this pull request is buggy. Don't use it in production.

And unfortunately, I'm not going to fix it, because priorities are changed, and we don't use ujson anymore. So I don't have any real project where I could test it propertly (and I don't have time to join new projects).

So guys, sorry, but I'm closing this pull request.

@vdmit11 vdmit11 closed this Jan 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants