Skip to content

OBSOLETE ReadOnlyMessage

haberman edited this page May 15, 2011 · 1 revision

This page is obsolete, and kept around only for historical interest.

The read-only message layer defines an in-memory representation for protocol buffers. This in-memory representation is very much like a C struct, in that all of its members live at a fixed offset from the beginning of the struct. Messages can contain values, arrays, and references to other messages.

If the event-based parser is like SAX, this layer is like an immutable DOM (for a mutable DOM, go one level higher).

Suppose we have the protocol buffer:

message Person {
  string name;
  int32 age;
  repeated Person children;
}

Here is how that looks in memory:

TODO: the above diagram needs to show:

  • msgdef pointer
  • gptr
  • set bits
  • array and string size member

And here is the corresponding structure in C:

/* Defined in upb_string.h */
struct upb_string {
  char *ptr;
  uint32_t byte_len;
};

/* Emitted by the upb compiler (upbc). */
struct Person {
  union {
    uint8_t bytes[1];
    struct {
      bool name:1;      /* = 1, optional. */
      bool age:1;       /* = 2, optional. */
      bool children:1;  /* = 3, repeated. */
    } has;
  } set_flags;
  struct upb_string* name;
  int32_t age;
  UPB_MSG_ARRAY(Person)* children;
};

We will draw your attention to several points.

messages have a pointer to their definition, making them self-describing
With only a pointer to a message, you can always get to the message’s definition.
submessages are by reference, not by value
Notice that submessages are just pointers to other messages. This is necessary in the general case (messages can be self-recursive, or mutually recursive, making a by-value implementation impossible). But more importantly, it makes it possible to implement the by-reference semantics found in many common programming languages like Python, Ruby, Lua, etc. In other words, when you say mybar = foo.bar, modifying mybar modifies foo.bar also.
strings are delimited, not NULL-terminated
This makes it possible to have strings that reference the middle of existing protocol buffer data, without having to overwrite that data with terminating NULL characters.
each message carries with it “set bits” describing whether each field is set or not
Protocol Buffers distinguish between unset fields (which automatically take the value of their default) and fields that have an explicitly set value.

Why are these messages immutable?

You might wonder, given the previous definition, what about these messages is immutable (read-only). What makes them immutable is that we have no way of tracking (at this level) the ownership of submessages, arrays, or strings. Without this kind of memory management, we cannot support assigning to a string field, because what would we do with the string that was already there? You might be tempted to say “free it, of course”, but what if that string was also being referenced by a different protobuf?

Without some memory management scheme, we cannot support mutations to string, array, or submessage members. Mutating simple scalar data is fine and safe, even at this level.

So what can I do with an immutable message?

With this API alone, you can:

  • parse a protobuf into a read-only message.
  • serialize a read-only message to a protobuf.
  • traverse the read-only message and inspect its values.

In other words, this API is just right if all you want to do is parse and inspect a protobuf. There is other functionality defined at this layer also, like the an equality test and a function for dumping to the protobuf text format.