Skip to content

MessageSemantics

haberman edited this page Jan 1, 2012 · 6 revisions

The ownership and default semantics of protobuf messages have some subtle corner cases. The two key considerations to reconcile are:

  1. we want to be able to read deeply nested fields (eg. foo.bar.baz) without having to first test for message presence at every level (eg. if (foo.has_bar() && foo.bar.has_baz())).
  2. when serializing a message, we don't want to serialize empty submessages just because we read a default value out of that submessage.

Scalar fields

The semantics for scalar fields (numbers, bools, strings) are simple: if you just read a field's default value but never set it, the value is considered unset and will not be serialized.

  // C++ example:
  MyMessage msg;
  int32_t x = msg.myfield();  // Returns default.
  msg.has_myfield();          // Returns false; will not be serialized.

  msg.set_myfield(5);
  msg.has_myfield();          // Returns true; will be serialized.

  msg.clear_myfield();
  msg.has_myfield();          // Return false; will not be serialized.

The semantics for a dynamic language like Python are almost identical:

  # Python example:
  msg = MyMessage()
  x = msg.myfield
  msg.HasField("myfield")   # Returns false; will not be serialized.

  msg.myfield = 5
  msg.HasField("myfield")   # Returns true; will be serialized.

  msg.ClearField("myfield")
  msg.HasField("myfield")   # Returns false; will not be serialized.

Submessage fields

Submessage fields are more complicated because we want to be able to inspect deep messages without causing any implicitly-created submessages to be serialized. There is also the issue of submessage ownership; languages without garbage collection like C++ often create an ownership model where submessages are owned by the parent message:

  // C++ example:
  MyMessage msg;
  msg.bar().baz();  // Returns default value; msg.bar() is const.
  msg.has_bar();    // Returns false; msg.bar will not be serialized.

  msg.mutable_bar()->set_baz(5);
  msg.has_bar()     // Returns true; msg.bar will be serialized.

  // C++ has direct ownership of submessages, so you can't assign
  // submessage instances.
  msg.set_bar(MyBarMessage());  // XXX does not exist

This ownership model doesn't fit dynamic languages so well. The mutable_ business in C++ isn't a good match for dynamic language conventions where "const" containers are generally not used.

  x = foo.bar.baz
  foo.HasField("bar")   # Returns false; we only inspected it, so it won't be serialized.

  # Python users expect to be able to say this:
  foo.bar.baz = 5
  foo.HasField("bar")   # Returns true because we set a field of the submessage.

  # It would be non-idiomatic and annoying if the design was like C++.
  # This is *not* how the Python bindings actually work.
  foo.bar.baz = 5   # Returns ERROR (hypothetically), foo.bar is immutable.
  foo.mutable_bar.baz = 5

One other thing that dynamic language users expect is that they can "reparent" messages at will.

  bar = Bar()
  msg = MyMessage()
  msg.bar = bar   # Should we allow this?

Should we allow this kind of reparenting or not? There are pros and cons. The pros are convenience and efficiency, as well as composability:

  # If I'm composing a message reparenting lets me compose the sub-parts in a more
  # functional style.
  msg.bar = MakeBar();

  # If I can't reparent, the above looks more like:
  FillInBar(msg.bar)

  # If I've obtained a Bar from some other data source, I can make it part of
  # another message without having to copy.
  msg.bar = ParseBar()

On the other hand, allowing reparenting opens some cans of worms:

  # If I can reparent, I can create cycles, which must be detected as an error
  # at serialization time (which would have a potentially significant cost).
  # It could be useful to create such cycles in some cases, but since they
  # aren't serializable it might be better to disallow them.
  msg.msg = msg

  x = msg.foo.bar  # Read only, won't serialize msg.foo.
  foo = msg.foo
  foo.bar = 5      # Write of foo, now msg.foo will be serialized, is this unexpected?

  x = msg2.foo.bar  # Read only, won't serialize msg2.foo.
  # Should msg3.foo be serialized, since it was explicitly assigned?
  # Or should msg3.foo *not* be serialized, since msg2.foo was
  # not written to?
  msg3.foo = msg2.foo  

Another issue: if the implicitly-created submessage has a field set but is later cleared, should the submessage be serialized?

  msg = MyMessage()
  msg.foo.bar = 5            # The write will cause foo to be serialized.
  msg.foo.ClearField("bar")  # Now should foo be serialized?
  msg.foo.Clear()            # How about now?

Semantics of submessages after Clear()

In C++, Clear() semantically deletes all submessages so keeping references to them after Clear() is illegal. (In practice the submessages are often cached for performance, but this is just an implementation detail and cannot be relied on).

  // C++ examples:
  MyMessage msg;
  msg.ParseFromString(...);
  const MyFooMessage& foo = msg.foo();
  msg.Clear();  // Referencing foo is illegal now and may crash!

In garbage-collected languages however, this behavior is not acceptable; client code should not be able to crash the interpreter/runtime by simply misusing an API.

  # Python example:
  msg = MyMessage()
  msg.ParseFromString(...)
  foo = msg.foo
  msg.Clear()
  func(foo.bar)  # foo must still be a valid reference; must not crash!

Because of this, it's easiest to let the interpreter/runtime's own GC collect submessage objects instead of deleting them explicitly in Clear() or in the parent's destructor or finalizer.

However, there is still the question of whether we can reuse submessage objects as an optimization, like C++ does. For reference-counted interpreters/runtimes we can safely perform this optimization, because we can know whether there are any other references to our cached object.

  # Python example:
  msg = MyMessage()
  msg.ParseFromString(...)
  msg.Clear()
  # It's safe for msg to cache any submessages that were created,
  # because no other reference to them was taken (their refcount
  # will be 1).

  msg.ParseFromString(...)
  # This increases msg.foo's refcount by 1:
  foo = msg.foo
  msg.Clear()
  # It is NOT safe to cache/reuse msg.foo, because we have a
  # separate reference on foo (its refcount is 2).  If we
  # cached it then a subsequent call to msg.ParseFromString()
  # would change foo's value which is unexpected.

However in interpreters/runtimes that do not use reference counting, it is not safe to perform this optimization because there is no way to know that another reference has been taken on the submessage.

  -- Hypothetical Lua example:
  msg = MyMessage()
  msg:ParseFromString()
  -- Another reference has been taken on msg.foo, but there is
  -- no way to detect this!
  foo = msg.foo
  msg:Clear()
  -- Unsafe to cache any submessage objects, because we don't
  -- know if references on them have been taken or not!