Skip to content

WhyThereIsNoStandardMessageObject

haberman edited this page Aug 3, 2011 · 4 revisions

There are two possible models for exposing protobuf message objects to high-level languages. Using Python as an example, the two options are:

  1. Store the protobuf data directly in Python objects for the target VM/interpreter, using Python’s built-in memory-management and threading semantics. upb’s parser creates/modifies Python objects, but Python handles refcounting and freeing.
  1. Define a standard message object (in C) that includes memory management and threading semantics. The Python objects then wrap this standard message type, and must do so in a way that respects this common set of memory management and threading semantics. The Python object is only a single pointer to the C message type. When the Python object is collected, it unref’s the C message object (which must have a separate refcounting or garbage collection scheme of its own).

(1) is simpler and more “native” from a Python perspective. It makes the Python protobuf objects behave more like regular Python objects. The main downside is that messages cannot be shared across languages (you have to serialize/deserialize first) any C++/SWIG code that wants to create/manipulate these messages has to use Python APIs for doing so.

(2) ends up being a lot more complicated and less “native.” Worse, it requires you to define a memory management and threading model that can interoperate with lots of different interpreters/VMs both gracefully and efficiently. You also end up having two parallel trees of message instances: the instances in C and the high-level language instances that wrap them.

upb previously did (2), but now favors (1). (2) is still possible — someone could still define a standard message type that could be shared across languages if it was worth the trouble, but for the moment it doesn’t appear to be worth it.

Here are the considerations that go into defining a “one size fits all” C message object:

self describing?
Given a message, should we be able to determine its type? This capability is extremely useful, but costs one extra pointer per message and per array. In many cases this pointer will be wasted because the interpreter/VM already stores the type of each object, and we can use this instead of storing the type inside the upb message.
const / mutable?
Should the message structure be mutable? Should there be a set of const accessors also that are thread-safe? Should the message start out as mutable but allow you to freeze it?
memory-management?
How are messages allocated? How is it decided who they belong to? When are they freed? Can they be shared between different components, using something like reference counting? There are many trade-offs here. Reference-counting costs an extra integer per message or array, and runtime costs to set and test the reference. It is also susceptible to circular references (which users should never create but it’s hardly acceptable to memory leak when they do). Mark-and-sweep garbage collection costs program pauses and cache trashing.
thread-safety?
Reading from an immutable message is thread-safe, but what if you want to mutate a potentially shared message? Should each message carry its own lock? That implies yet more per-message space overhead, and time overhead also. In most cases we expect locking to be handled at a higher level, but this leads us to:
eager or lazy parsing?
Protocol Buffers have the nice property that they can be parsed lazily. In other words, when someone calls “parse” on a string, it is possible to implement this such that no parsing happens up-front, but is delayed until specific fields are referenced. This can be a significant efficiency win, but it leads to two major problems: (1) with lazy parsing, even reads of a message are not thread-safe, and (2) this can cause parse errors to go undetected until a field is referenced, which is a very inconvenient and unexpected time to handle them.
referencing original string data?
Another useful optimization is to make string fields reference the original protobuf data instead of copying it. This saves malloc() and memcpy() costs in the parsing critical path. But then the question arises of how to properly track that reference and deal with the scenario that the original owner of that data decides to free the string. It’s also possible for this optimization to be quite wasteful in terms of memory footprint if the amount of string data being referenced is small in comparison to the total size of the protobuf.
sharing between languages?
You can, by spending one extra pointer per message, provide a nice way to let different dynamic langauge implementations share an in-memory message, by giving the message one extra pointer to a linked list of dynamic language objects that are referring to that message. But this is yet another per-message cost of one pointer per message.