implement encode/decode hooks that allow to define custom serialization formats #209

vdmit11 · 2016-02-04T02:29:45Z

This pull request proposes encoding/decoding hooks, that is,custom Python functions that you may pass to ujson.encode()/ujson.decode() and transform objects on the fly during the encoding/decoding process.

Basically this is a customization feature. Especially useful to override certain behavior already implemented by the ujson code, like datetime serialization or __json__() method handling (see below).

The hook behavior is similar to json.dumps(default=fn) and json.loads(object_hook=fn) semantics, you just pass a function that takes an object and possibly replaces it with another object. But unlike the default hook (that is called at the end) I implemented the pre_encode_hook (that is called before any other transformations are made), and this is done intentionally to allow user to override certain ujson's behavior inside the hook.

See the README.rst and tests.py for usage examples.

Motivation:

Customization features are demanded for long time, see Add support for custom encoders #124, Support for custom date serialization? #139.

People are making their own forks and do minor changes in the C code to customize serialization for various types. That should be solved by extending the Python API.

Hooks may be a good tradeoff solution (not as good as an object-oriented API where you can override methods and so on, but still ok).
The current ujson's implementation translates date and datetime objects into timestamps, and there is no option for switching to another format (like ISO 8601).

Actually the format is not defined by JSON, so any hardcoded implementation is doomed.

I guess the pull request Serialise datetime and date objects #201 is related here: it defines another hard-coded format, so it doesn't solve this problem, but still may be useful as a performance optimization for this specific format.
The __json__() method also has a problem: ujson expects this method to return a string containing valid JSON, while the usual approach (or at least how I know it) is to return an object like a dictionary from the __json__() method.

Recently I had noticeable issues because of that when I tried to switch to ujson on one of projects. And the toDict() method is not always the solution, since you don't always return a dictionary. And I think generally the ujson should try to be compatible with already existing code (ideally, a drop-in replacement), so there should be an option for changing the __json__() semantics without changing tons of ORM classes that implement this method.

And if you change the ujson's behavior to this "usual" approach, you will break the backwards compatibility. Again, there is no standard option here, so it would be better to let people to override this behavior with an arbitrary implementation.

So, now I'm asking someone to review the changes and make some comments.
Especially pay some attention to reference counting, I'm not sure that I handled it correctly.

Thanks!

This commit introduces several hooks for encoding/decoding functions: - ujson.encode(pre_encode_hook=func) - ujson.decode(object_hook=func) - ujson.decode(string_hook=func) These hooks may be used for defining custom JSON representations for arbitray object types. The pre_encode_hook is similar to the standard JSONSerializer.default() method. But unlike the default() method, the pre_encode_hook() is called before any other encoding attempts, and thus the hook allows to override some ujson behavior, like the __json__() method or datetime format. The object_hook semantics is similar to the standard json.loads(object_hook). It allows to deserialize JS objects into specific Python class instances instead of generic dictionaries. The string_hook is like object_hook, but called for every parsed string, and thus it allows to deserialize strings into Python objects. This may be useful for dates and other objects commonly represented as strings. This is the very first prototype implementation, so more commits will follow.

Jahaja · 2016-02-08T14:29:10Z

Hi Dmitry, Thanks for the PR. Have you made pre and post benchmarks when this is not used?

vdmit11 · 2016-02-08T21:12:05Z

Just made the benchmarks, and got the following results:

Python2/pre:
ujson encode: 3630.95 calls/sec
ujson decode: 26398.61 calls/sec

Python2/post:
ujson encode: 3619.23 calls/sec
ujson decode: 26555.77 calls/sec

Python3/pre:
ujson encode: 3624.32 calls/sec
ujson decode: 25542.67 calls/sec

Python3/post:
ujson encode: 3601.77 calls/sec
ujson decode: 25418.21 calls/sec

Full logs are available here:
benchmark.log.txt
benchmark_results_only.txt

The difference is less than 1%, so I can conclude that there is no significant impact on the performance.

problem: basically the NULL pointer dereference When an exception happens in a Python function, you get NULL from PyObject_CallFunctionObjArgs() (instead of a valid PyObject pointer). The value then is dereferenced and you get in trouble. So this commit adds the NULL check with corresponding unit tests. Also it adds few minor fixups: - Don't execute internal ujson's API callbacks (callPreEncodeHook() and callObjectHook()) when these hooks are not actually passed from the Python code. This is a very minor optimization, saves one C function call per object. - Get rid of the GET_EP() macro. It is used only once and therefore is redundant.

The pre_encode_hook() is called for every single JSON value. This is slow, especially if you have a lot of objects and complicated hook function. This performance overhead may be reduced if you don't call the pre_encode_hook() for primitive types like Number and String (since usually you don't need to handle these types). So this commits makes this behavior default. By default you get good performance, but the hook is not called for primitive types. and this behavior can be disabled like this: ujson.encode(pre_encode_primitive=True)

mitar · 2016-03-13T07:01:01Z

I think we should just then change __json__ to __ujson__, if you are observing backwards compatibility issues. But having it be able to output JSON string directly is very useful for cases where you do not want to convert GIS JSON into a dict, just so that it can again be serialized.

mitar · 2016-03-13T07:31:08Z

Or __raw_json__?

greedo · 2016-03-23T23:02:34Z

👍 I would really like to use ultrajson and I need these hooks for datetime strings

mayerwin · 2016-03-29T15:09:55Z

Excellent!
For some inspiration, I recommend taking a look at how the excellent ServiceStack.Text library (for .NET) implements this through the following hooks:

SerializeFn
DeSerializeFn
RawSerializeFn
RawDeSerializeFn

marc1n · 2016-04-21T08:32:51Z

Why this PR is not merged yet? Is anybody here???
This is crutial PR in order to use ujson in serious projects!
@Jahaja Please merge this PR as soon as possible!

jd-boyd · 2016-07-21T21:03:49Z

I note that @Jahaja hasn't done anything publicly on github for 29 days now. It looks like the most recent activity from anyone in the organization was @SneWs from 17 days ago.

zeitos · 2016-12-23T17:59:32Z

Hello?

vdmit11 · 2017-01-01T16:05:47Z

Hi guys.

I have bad news: these changes should NOT be used, because there is a memory leak in there.

I didn't investigate it properly, but what I know is that recently we discovered a memory leak on a project where ujson (with this custom-hooks-feature) is involved. And basically we disabled ujson completely and the problem gone.

And since ujson was previously tested by a lot of users, and there was no memory leak, then I can conclude that the problem is introduced by me.

So this pull request is buggy. Don't use it in production.

And unfortunately, I'm not going to fix it, because priorities are changed, and we don't use ujson anymore. So I don't have any real project where I could test it propertly (and I don't have time to join new projects).

So guys, sorry, but I'm closing this pull request.

vdmit11 added 3 commits February 3, 2016 03:20

hooks: add documentation (code comments and sections in README)

370aafc

hooks: remove leading zeros from few numeric literals (fix typos)

f79c9d7

vdmit11 added 2 commits February 10, 2016 16:35

kootenpv mentioned this pull request Jan 1, 2017

A new maintainer has to be appointed. #246

Closed

vdmit11 closed this Jan 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement encode/decode hooks that allow to define custom serialization formats #209

implement encode/decode hooks that allow to define custom serialization formats #209

vdmit11 commented Feb 4, 2016

Jahaja commented Feb 8, 2016

vdmit11 commented Feb 8, 2016

mitar commented Mar 13, 2016

mitar commented Mar 13, 2016

greedo commented Mar 23, 2016

mayerwin commented Mar 29, 2016

marc1n commented Apr 21, 2016

jd-boyd commented Jul 21, 2016

zeitos commented Dec 23, 2016

vdmit11 commented Jan 1, 2017

implement encode/decode hooks that allow to define custom serialization formats #209

implement encode/decode hooks that allow to define custom serialization formats #209

Conversation

vdmit11 commented Feb 4, 2016

Jahaja commented Feb 8, 2016

vdmit11 commented Feb 8, 2016

mitar commented Mar 13, 2016

mitar commented Mar 13, 2016

greedo commented Mar 23, 2016

mayerwin commented Mar 29, 2016

marc1n commented Apr 21, 2016

jd-boyd commented Jul 21, 2016

zeitos commented Dec 23, 2016

vdmit11 commented Jan 1, 2017