Serialisation and Dumping #4456

samuelcolvin · 2022-08-31T14:00:23Z

samuelcolvin
Aug 31, 2022
Maintainer

(A note on language, "dump" is used as the verb for the thing we're talking about, and consequently in method names,
"serialisation" is used as the noun, thus "dump" and "serialise" are considered synonyms in this context. Both are arguably wrong since we're often converting one python object into another, but I can't think of better terminology.)

Related discussions

The Improvements to Dumping/Serialization/Export
Section of the V2 blog.
Python JSON-like dictionary return #377
Customizing both decoding and encoding of a type #951
Config json_encoders does not seem to work on builtin types #2531
Other issues and pull requests with the dumping label

Required Features

Below is a list of features I think people want, let me know if I've missed anything significant.

Models and non-models

Most past conversations about this relate primarily to models since they are the primary building block for everything in pydantic V1. But in V2 models are no-longer the "quantum" of all validation and the output type of a validation schema can be virtually anything.

We therefore need to provide ways to customize serialisation of any object as well as continuing to support models.

Customising type serialisation

Including:

builtin types that json.dumps handles by default and therefore doesn't easily allow customisation of
custom pydantic types
custom project types
third-party types (e.g. from other libraries)
nested pydantic models

Customising field serialisation

E.g. a way to define how model.age is serialised as opposed to model.id although they're both ints.

JSON serialisation

JSON is a special case and we want builtin support for creating JSON from models and other objects.

Aliases

standard aliases as we have now
custom alias for dumping

This is pretty simple at dump-time since we can decide on the alias when building the schema.

We should continue to support by_alias.

`exclude_unset`, `exclude_defaults` and `exclude_none`

I guess these have to remain.

`include` and `exclude`

Currently, this is configurable via dict(include=...) and dict(exclude=...) which has some very complex behaviour. (dict() is being renamed to model_dump() in V2 as per Nodel Namespace Cleanup)

It would make the logic much simpler, and probably make execution significantly faster, if we could remove
include and exclude arguments and only allow them to be hard-coded on the model,
but I assume this would cause a revolution. (Also I know how useful customising these when calling model_dump can be.)

We currently also have exclude and include on a Field, I think we should definitely keep exclude.
The precise semantics of when include trumps exclude is inherently complex, it would be good if we could remove
include from Field and only allow it via model_dump.

Output formats

Similar to the blog post, we need the following output formats:

Python objects - models are converted to dicts, nothing else is changed
JSON compatible python - objects are converted to dict, list, str, int, float, bool, None
JSON - output is a JSON string (or bytes)

All these formats should be customisable via functions - obviously 2 and 3 should use the same customisation logic.

Implementation

The plan is to implement as much of this logic as possible in rust within pydantic-core.

The key insight I have after thinking about this for a while is this:

json.dumps, JSON.stringify and friends rely on some variation of isinstance since they don't know anything
about the data they're serialising until they're called, but we know the types of data before we serialise it.
We can therefore prepare the serializers and skip expensive isinstance checks by building a Serializer that
shadows the model it is built to dump.

This approach is also very close to what we do for validation, and thus should allow good symmetry between the validation and serialisation.

There are two possible approaches here:

Extend the Validator trait to also support serialisation.
Create a new Serializer trait and all required implementations

The second approach is probably more code but would provide a cleaner separation of concerns. The real question is whether
all the validators make sense as serializers, and similarly whether there are serializers that don't make sense as validators.

Regardless of which approach we take, the basic idea would be the same: we have default implementations for "python" and "json" and the option to override with a python function.

The rough signature of the trait would be something like (I'm using python here for pseudo-code, but the actual implementation would be in rust)

class Serializer(abc.ABC):
    def __init__(self, customisation_functions: Dict[str, Callable]):
        self.customisation_functions = customisation_functions
        self.plain_python_function = customisation_functions.get('python')
        self.json_function = customisation_functions.get('json')

    def to_plain_python(self, value: Any) -> Any:
        """
        Used by default `model.model_dump() -> Any`, aka ``model.model_dump(format='python') -> Any`
        """
        if self.plain_python_function:
            return self.plain_python_function(value)
        else:
            return self._to_python(value)

    def to_format(self, value: Any, format: str) -> Any:
        """
        Used by default `model.model_dump(format='foobar') -> Any`
        """
        f = self.customisation_functions.get(format)
        if f:
            return f(value)
        else:
            return self._to_python(value)

    def to_json(self, value: Any) -> Any:
        """
        Used by default `model.model_dump_json() -> str` and `model.model_dump(format='json') -> JsonType`
        """
        if self.json_function:
            return self.json_function(value)
        else:
            return self._to_json(value)
    
    @abc.abstractmethod
    def _to_python(self, value: Any) -> Any:
        raise NotImplementedError

    @abc.abstractmethod
    def _to_json(self, value: Any) -> Any:
        raise NotImplementedError

This is slightly naive, in reality there would be some more complexity in returning a rust enum of PyObject or JsonType
from to_json to avoid having to speed up JSON generation.

We would then have two "finalisers" (better name required) which combine serialised fields into either a python
object or JSON.

cc @PrettyWood @tiangolo @hramezani

I have more to say on this, but I need to go for lunch... I'll try to add more soon.

PrettyWood · 2022-09-03T11:29:09Z

PrettyWood
Sep 3, 2022
Collaborator

I'm wondering on the separation of concerns between input and output.
I expect the API to support properly load_alias and dump_alias (names to be defined). So we would also have model_load_json_schema and model_dump_json_schema, which would be very nice for ORMs, FastAPI, ... built on top of pydantic.
For this I'm already not clear how a model should be written for this to be explicit.

The second question after the "what is in the (default) output" is "how is it generated?"
For this I agree that a new Serializer trait makes a lot of sense. It mimics validation while being properly decorrelated from it.
The only issue that I see is if a model is verified and altered afterwards without validation on assignment. A field could have another type as the defined one on the model and serialization would break.

0 replies

tiangolo · 2022-09-03T15:26:26Z

tiangolo
Sep 3, 2022
Collaborator

All this sounds good to me and I don't think I have many strong or useful opinions.

I agree with @PrettyWood that being able to separate aliases for loading and dumping would probably make sense.

The other thing is just that I want to make sure I understood correctly, when talking about "python" and "json", the input is always a Pydantic field/model, right? And the output is dicts or JSON str, correct? I mean, this is talking specifically about those things and not about loading data, right?

1 reply

samuelcolvin Sep 16, 2022
Maintainer Author

yes, here we're talking about serialisation only so the thing we're serialising is always a python object.

mdavis-xyz · 2022-11-01T23:12:03Z

mdavis-xyz
Nov 1, 2022

I would like an easy way to set indent=2 as the default value when calling whatever the new equivalent of .json() is.

Currently the BaseModel config options don't have that as an option. So in my project I currently do:

class BaseModel(pydantic.BaseModel):
    def json(self, exclude_unset=True, exclude_none=True, indent=2, **kwargs):
        return super().json(exclude_unset=exclude_unset, exclude_none=exclude_none, indent=indent, **kwargs)

And then everything else uses that new BaseModel instead of pydantic.BaseModel.

(And there's some other stuff in my BaseModel as well.)

1 reply

samuelcolvin Nov 8, 2022
Maintainer Author

it does, you can just call .json(indent=2) - unrecognised kwargs are forwarded to json.dumps.

See

pydantic/pydantic/main.py

Line 474 in 62f96e7

**dumps_kwargs: Any,

mdavis-xyz · 2022-11-01T23:55:11Z

mdavis-xyz
Nov 1, 2022

One of the main inputs and outputs of my scripts that use pydantic is AWS' DynamoDB no-sql database.

The boto3 SDK only handles Decimal (and int), not float. When I read from dynamodb it gives me Decimal, and pydantic can coerce that into float, which is fantastic.
But if I want to write a pydantic model to dynamodb, I need to convert all floats to Decimal. I also need to convert datetimes to strings. This is not straightforward with pydantic v1.

As I mentioned in my previous comment, I currently subclass BaseModel, and then every other model subclasses that.

class BaseModel(pydantic.BaseModel):
    # decimal=True means convert all floats to Decimal
    # primitives=True means convert enums, uuids, datetimes etc all to str
    def dict(self, decimal=False, primitives=False, exclude_unset=True, exclude_none=True, **kwargs):
        if primitives:
            # just dump to json, then parse back
            # a bit slow, but fairly reliable,
            # since we've already specified json encoders
            d = json.loads(self.json(exclude_unset=exclude_unset, exclude_none=exclude_none, **kwargs))
        else:
            d = super().dict(exclude_unset=exclude_unset, exclude_none=exclude_none, **kwargs)
        if decimal:
            d = float_to_decimal(d)
            
        return d

    class Config:
        use_enum_values=True # not sure if this works recursively
        validate_assignment = True 
        json_encoders = {
            dt.datetime: encode_datetime,
            dt.time: encode_time,
            dt.date: encode_date
            # TODO: add UUID here?
        }


def float_to_decimal(val):
    if isinstance(val, Decimal):
        return val
    elif isinstance(val, float):
        return Decimal(str(val))
            
    elif isinstance(val, list):
        return [float_to_decimal(x) for x in val]
    elif isinstance(val, tuple):
        return tuple(float_to_decimal(x) for x in val)
    elif isinstance(val, dict):
        return {k:float_to_decimal(val[k]) for k in val}
    else:
        return val

It would be great if I could dump my model into a dict that contains Decimals not floats. (And strings not datetimes.)

e.g. mymodel.dict(decimal=True, dt_to_str=True)

1 reply

samuelcolvin Nov 8, 2022
Maintainer Author

yup, this flexibility is a requirement. Many people want to represent decimals as strings in JSON, e.g. see #1004

samuelcolvin · 2022-11-08T12:20:54Z

samuelcolvin
Nov 8, 2022
Maintainer Author

I'm working on this now.

Work is being tracked in #4739.

0 replies

NowanIlfideme · 2023-04-26T19:15:42Z

NowanIlfideme
Apr 26, 2023

It's currently unclear how to apply custom serializers per-type or per-field in models for V2. All I've found in pydantic or pydantic-core source code has been this comment in pydantic.deprecated.json:

# TODO: Add a suggested migration path once there is a way to use custom encoders
@deprecated('custom_pydantic_encoder is deprecated.')
def custom_pydantic_encoder(...): ...

which suggests this is still to-be-done, however as per above #4739 seems to be "done"? If so, what's the equivalent of the mechanism to activate this pseudo-code from above?

def to_format(self, value: Any, format: str) -> Any:

Seriealization is super relevant for https://github.com/NowanIlfideme/pydantic-yaml and https://github.com/NowanIlfideme/pydantic-kedro ; thank you in advance! 😃

0 replies

ErikvdVen · 2023-06-30T10:13:28Z

ErikvdVen
Jun 30, 2023

As discussed over here: #3293 It would be nice to just have to explicitly mention a parent class as type and let Pydantic dynamically find the right child class based on a discriminator. For now this works, but Pylance gives an error as Union visually contains just one argument:

class Context(BaseModel):
    houses: Union[House.get_subclasses()] = Field(discriminator="type")
    
@unique
class HouseType(str, Enum):
    VILLA = "villa"
    CASTLE = "castle"
    VAN = "van"

class House(BaseModel):
    @classmethod
    def get_subclasses(cls):
        return tuple(cls.__subclasses__())

class Villa(House):
    type: Literal[HouseType.VILLA] = HouseType.VILLA

class Castle(House):
    type: Literal[HouseType.CASTLE] = HouseType.CASTLE 
    
class Castle(House):
    type: Literal[HouseType.VAN] = HouseType.VAN

It will make it less messy if there are more subclasses, because you don't have to mention them explicitly in the Union.

7 replies

shakfu Sep 24, 2023

@samuelcolvin Thanks for your reply.

Here is a working example of my issue which is to have subclasses be properly converted during deserialization:

from pydantic import BaseModel

class Box(BaseModel):
    maxclass: str = "newobj"

class Message(Box):
    maxclass: str = "message"

class Int(Box):
    maxclass: str = "int"

class Document(BaseModel):
    boxes: list[Box] = []


b = Box()
m = Message()
i = Int()

d1 = Document(boxes=[b,m,i])

dumped = d1.model_dump()
print('serialized:', dumped)
# serialized: {'boxes': [{'maxclass': 'newobj'}, {'maxclass': 'message'}, {'maxclass': 'int'}]}

d2 = Document.model_validate(dumped)
print('deserialized:', d2)
# deserialized: boxes=[Box(maxclass='newobj'), Box(maxclass='message'), Box(maxclass='int')]
# instead of
# deserialized: boxes=[Box(maxclass='newobj'), Message(maxclass='message'), Int(maxclass='int')]

shakfu Sep 24, 2023

The closest I could get to your example didn't work for me:

from pydantic import BaseModel, Field
from typing import Union, Literal

class Box(BaseModel):
    maxclass: Literal['newobj'] = "newobj"
    @classmethod
    def get_subclasses(cls):
        return tuple(cls.__subclasses__())

class Message(Box):
    maxclass: Literal['message'] = "message"

class Int(Box):
    maxclass: Literal['int'] = "int"

class Document(BaseModel):
    boxes: list[Union[Box.get_subclasses()]] = Field(discriminator="maxclass")


b = Box()
m = Message()
i = Int()

d1 = Document(boxes=[b,m,i])

dumped = d1.model_dump()
print(dumped)

d2 = Document.model_validate(dumped)
print(d2)

which gave this error:

% python3 example2.py
Traceback (most recent call last):
  File "$HOME/example2.py", line 25, in <module>
    d1 = Document(boxes=[b,m,i])
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydantic/main.py", line 165, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for Document
boxes.0.Message
  Input should be a valid dictionary or instance of Message [type=model_type, input_value=Box(maxclass='newobj'), input_type=Box]
    For further information visit https://errors.pydantic.dev/2.3/v/model_type
boxes.0.Int
  Input should be a valid dictionary or instance of Int [type=model_type, input_value=Box(maxclass='newobj'), input_type=Box]
    For further information visit https://errors.pydantic.dev/2.3/v/model_type

shakfu Oct 7, 2023

@samuelcolvin

Sorry to bring this up again, but is there any guidance on the above issue?

shakfu Oct 8, 2023

It's ok, I finally figured it out.

Here's the pydantic2 variant using the example in disccusion-3293:

from pydantic import BaseModel, Field
from typing import Dict, Literal, Union
from typing_extensions import Annotated

class MainClass(BaseModel):
    x: int
    type: Literal['Main'] = 'Main'
    
    @classmethod
    def get_subclasses(cls):
        return tuple(cls.__subclasses__())

class SubclassA(MainClass):
    a: int
    type: Literal['A'] = 'A'

class SubclassB(MainClass):
    b: int
    type: Literal['B'] = 'B'

class MyCatalog(BaseModel):
    stuff: Dict[str, Annotated[Union[MainClass.get_subclasses()], Field(discriminator='type')]]

if __name__=='__main__':

    cat = MyCatalog(stuff={})
    cat.stuff['item1'] = SubclassA(x=0, a=1)
    cat.stuff['item2'] = SubclassB(x=0, b=1)

    exported = cat.model_dump_json()
    imported = MyCatalog.model_validate_json(exported)

    print(cat)
    print(cat.model_dump())
    print(cat.model_dump_json())
    print(imported)

Running this gives the following output:

% python3 demo3.py
stuff={'item1': SubclassA(x=0, type='A', a=1), 'item2': SubclassB(x=0, type='B', b=1)}
{'stuff': {'item1': {'x': 0, 'type': 'A', 'a': 1}, 'item2': {'x': 0, 'type': 'B', 'b': 1}}}
{"stuff":{"item1":{"x":0,"type":"A","a":1},"item2":{"x":0,"type":"B","b":1}}}
stuff={'item1': SubclassA(x=0, type='A', a=1), 'item2': SubclassB(x=0, type='B', b=1)}

ErikvdVen Oct 24, 2023

@shakfu I've updated the code. My mistake, used some code from another example. But good to see you have figured it out already.

finswimmer · 2023-07-07T10:04:33Z

finswimmer
Jul 7, 2023

Hey,

not sure if its the right place here, or if I should open my own discussion.

In v1 the model.json(...) has an attribute encoder. I'm using it the manipulate the output only in tests. Namely I want to output the value of an SecretStr in plain text. There seems to no equivalent for it for model_dump_json() or model_dump() right now, is it?

6 replies

samuelcolvin Jul 31, 2023
Maintainer Author

We now support json_encoders.

More generally you can subclass SecretStr, I'm on my mobile right now so can't give an example right now.

Lawouach Jul 31, 2023

Hey thanks @samuelcolvin for the swift reply. Right-o, I'll dig deeper then! Cheers

Lawouach Jul 31, 2023

Interestingly, the doc makes reference to field_serializer for that use case. That looks more aligned to pydantic v2 to me so perhaps the better path.

https://docs.pydantic.dev/latest/usage/types/secrets/

MPKonst Oct 19, 2023

I also echo this question. Moreover, while the field_serializer approach is nice when we want a SecretStr to always serialise to plain text, I think that's the less likely scenario. What seems more likely to me is that most of the time we'd want a SecretStr to be protected and only sometimes, when we know exactly where the serialised settings are going, we might want to have it in plain text.

So it would make sense to me to add a flag to .model_dump e.g. my_settings.model_dump(get_secret_values=True)

pmbrull Feb 10, 2024

In case it's helpful, I wanted to define a subclass of SecretStr and had it show the secret values during model_dump_json.

The goal is to control the serialization at the type itself, rather than having to define a field_serializer for every model that would use the custom type.

It ended up looking like:

from pydantic import SecretStr, PlainSerializer

class _EncryptedStr(SecretStr):
    """My Custom implementeation"""
    ...

EncryptedStr = Annotated[
    _EncryptedStr,
    PlainSerializer(lambda secret: secret.get_secret_value())
]

Happy to read better alternatives. Thanks!

hahn-th · 2024-04-26T18:50:49Z

hahn-th
Apr 26, 2024

Hi there, i faced nearly the same problem. I am using a discriminator-function with tags and want to build the union list dynamically. After playing around a bit i found a quite nice solution which i want to share with you

I created a type registry:

def singleton(cls):
    instances = {}
    def getinstance(*args, **kwargs):
        if cls not in instances:
            instances[cls] = cls(*args, **kwargs)
        return instances[cls]
    return getinstance


class TypeRegistry(ABC):
    
    def __init__(self, registry_name: str, logger):
        self._registry_name = registry_name
        self._logger = logger
        self._type_map = {}

    def register(self, enum_value, registered_type):
        """Registers a device"""
        if enum_value in self._type_map:
            self._logger.debug(f"For enum-value {enum_value} the type {type(registered_type)} will be updated (Registry: {self._registry_name}).")
        else:
            self._logger.debug(f"For enum-value {enum_value} the type {type(registered_type)} will be registered (Registry: {self._registry_name}).")
        
        self._type_map[enum_value] = registered_type

    def type_is_registered(self, enum_value) -> bool:
        """Check, if a type has been registered."""
        return (enum_value in self._type_map)

    def get_class_for_type(self, enum_value) -> type:
        """Get the registered type for an enum-value."""
        if enum_value not in self._type_map:
            self._logger.warning(f"Requested type {enum_value} has not been registered yet (Registry: {self._registry_name}).")
            return None
        
        return self._type_map[enum_value]
    
    def get_type_map(self) -> dict:
        """Return  a copy of the type map."""
        return self._type_map.copy()


@singleton
class DeviceTypeRegistry(TypeRegistry):
    
    def __init__(self, ):
        super().__init__("DeviceRegistry", logging.getLogger("DeviceTypeRegistry"))

and a mapper, which maps the TypeRegistry into a tuple:

class TypeRegistryMapper:

    @classmethod
    def as_tuple(self, type_registry: TypeRegistry):
        ret = []
        for type_name, type_class in type_registry.get_type_map().items():
            ret.append(Annotated[type_class, Tag(type_name)])

        return tuple(ret)

The TypeRegistry is filled by a decorator:

def DeviceReg(value):
    """Add type to TypeRegistry"""
    def decorator(cls):

        registry = DeviceTypeRegistry()
        registry.register(value, cls)        

        return cls
    return decorator

Discriminator Function to choose Tags. The value 'type' comes from json:

def device_discriminator(v: Any) -> str:
    device_type = v.get("type")
    reg = DeviceTypeRegistry()
    if reg.type_is_registered(device_type):
        return device_type
    
    return "DEVICE_BASE"

Sample Model:

class Client(BaseModel):
    homeId: str = ""
    id: str = ""
    label: str = ""
    clientType: str = ""

class ModelDeviceBase(BaseModel):
    pass

@DeviceReg("DEVICE_BASE")
class Device(ModelDeviceBase):
    availableFirmwareVersion: str = ""
    connectionType: str = ""
    firmwareVersion: str = ""
    firmwareVersionInteger: int = ""
    type: str

@DeviceReg("RAIN_SENSOR")
class RainSensor(Device):
    type: str

@DeviceReg("DIN_RAIL_SWITCH")
class DinRailSensor(Device):
    type: str    

class Base(BaseModel):
    clients: dict[str, Client] = {}
    devices: dict[
        str,
        Annotated[
            Union[
                TypeRegistryMapper.as_tuple(DeviceTypeRegistry())
            ],
            Discriminator(device_discriminator),
        ],
    ]

The RegistryToUnionMapper creates the annotated tag list.

So i just have to add a new class with the @DeviceReg decorator and it is automatically registered.

I love it! Maybe it may helps someone else.

0 replies

Serialisation and Dumping #4456

samuelcolvin Aug 31, 2022 Maintainer

Related discussions

Required Features

Models and non-models

Customising type serialisation

Customising field serialisation

JSON serialisation

Aliases

exclude_unset, exclude_defaults and exclude_none

include and exclude

Output formats

Implementation

Replies: 9 comments · 16 replies

PrettyWood Sep 3, 2022 Collaborator

tiangolo Sep 3, 2022 Collaborator

samuelcolvin Sep 16, 2022 Maintainer Author

samuelcolvin Nov 8, 2022 Maintainer Author

samuelcolvin Nov 8, 2022 Maintainer Author

samuelcolvin Nov 8, 2022 Maintainer Author

samuelcolvin Jul 31, 2023 Maintainer Author

samuelcolvin
Aug 31, 2022
Maintainer

`exclude_unset`, `exclude_defaults` and `exclude_none`

`include` and `exclude`

Replies: 9 comments 16 replies

PrettyWood
Sep 3, 2022
Collaborator

tiangolo
Sep 3, 2022
Collaborator

samuelcolvin Sep 16, 2022
Maintainer Author

samuelcolvin Nov 8, 2022
Maintainer Author

samuelcolvin Nov 8, 2022
Maintainer Author

samuelcolvin
Nov 8, 2022
Maintainer Author

samuelcolvin Jul 31, 2023
Maintainer Author