Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would a "FlatBuffers2" binary format look like? #5875

Open
aardappel opened this issue Apr 26, 2020 · 135 comments
Open

What would a "FlatBuffers2" binary format look like? #5875

aardappel opened this issue Apr 26, 2020 · 135 comments
Labels
not-stale Explicitly exempt marking this stale

Comments

@aardappel
Copy link
Collaborator

aardappel commented Apr 26, 2020

FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.

So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)

Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.

  1. Remove all padding. Modern CPUs can access unaligned data at normal speed. This would shrink the format somewhat, and encourage other variable size things. If anyone ever needs padding to be compatible with a C struct, explicit padding can always be added, or it can be an opt-in feature.
  2. Make unions into a single field (and vectors of them into a single value). Also make the type part 16-bit while we're at it, so a union is always a 6-byte struct.
  3. Remove the 2nd field of the vtable. This field stores the table size, but it is never used in any implementations. This was intended for a streaming API that never happened.
  4. Allow different size vtable offsets. Currently they're always 16-bit, but for small tables 8-bit would be feasible. Since we code-gen vtable access, this would come at no cost. Use with care of course, because once you choose this smaller size you can't undo it when your table grows.
  5. Allow inline vtables when we determined they're unlikely to be shared. Saves an offset.
  6. Allow inline strings, vectors (and maybe scalars), meaning a vtable offset would refer directly to the string, rather than to the string offset. Saves the offset. Of course puts more pressure on the vtable offset size, so use with case. Similarly, could even do inline scalars of all small scalar types. Of course this makes it more likely that vtables are unequal, so this is a tradeoff.. would work well with inline vtables.
  7. Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.
  8. Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32. Good for small strings. Combine all the string optimisations above together, and the string "a" goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.
  9. Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Unsigned child offsets would now always point downwards in memory. Downside: must now detect table fields pointing to the table itself.
  10. Always have a file_identifer, and make it the first thing in the buffer? Always have a length field as well?
  11. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298
  12. For a buffer that has entirely un-shared vtables (see 5), it now becomes more feasible to allow in-place mutation of more complex things. This is definitely a complex/contentious feature, but I think if we ever re-booted the format this should be designed in from the start if possible.
  13. Deeply integrated FlexBuffers, basically allowing any field to cheaply be a FlexBuffers value such that it effectively becomes FlatBuffers's "dynamic type". Sharing of strings across such values rather than being an isolated nested buffer.
  14. Nested vectors. Not strictly a breaking change, but a new format would probably want to have them from the start.
  15. Built-in LEBs (variable sized integers) as an optional varint type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.

@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional

@krojew
Copy link
Contributor

krojew commented Apr 26, 2020

Some quick thoughts:

Ad 1. We would need to (potentially) deal with compiler/target platform problems when dealing with unaligned access. Is adding padding such a big problem at the moment?

Ad 4. This sounds like a compatibility problem while migrating schemas.

@aardappel
Copy link
Collaborator Author

@krojew

  1. is not a problem. In the most basic case you make every scalar load go thru memcpy which gets optimized by every compiler (tested with clang, gcc and vs) into a single memory load, but one that is guaranteed to work unaligned. Padding is not a huge problem, but not having it enables a lot of other features (see the rest of the list) which would normally be pointless since 32-bit alignment is so prevalent.

  2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

@mikkelfj
Copy link
Contributor

mikkelfj commented Apr 26, 2020

random thoughts:

Overall, I like many of the suggestions, but there are too many optional and variable parameters in the proposal which would make things slow. There is a lot more branching going on, and code complexity.

detractors to Wouters original comment:

  • Ad. 4. variable size vtables make it slow.
  • Ad 5. inline vtables not a good idea - would require slow extra check always, or at the very least it would have to be specific to the table type.
  • Ad 6. Optional offsets are slow. It would need to be a separate type. It's just too complex.
  • Ad 11. 64-offsets optional - also extra check slow speed, but do think we need 64 bit offsets as a separate buffer type.
  • Ad 7. removal of 0 termination in strings: I'm not sure what that buys us. It saves a single byte if we no longer have padding requirements, that is all. I'm all in favor in string views when possible - use similar in C, but there is no standard, and it would force a memory copy in many C and C++ API calls, for example file names. I don't like that strings are different in FlatBuffers, but not having a 0 is worse.
  • Ad 12. mutations: these are inherently unsafe - because verifiers would be expensive if they should check that this is safe - so I wouldn't design the format around that. I would make it simpler to copy a buffer or part of a buffer without knowing its type.
  • Ad 13. FlexBuffers - I'm not a big fan. I see it as a complication. I'd rather have strong JSON integration.
  • Ad 10. In the current FB format you cannot know if there is file identifier, so in that sense it is good to always require it, but I think it is not used very often in praxis and many different tables in the schema might be using the same file identifier. The type hash that FlatCC introduced to work around that was never broadly adopted. Human specified identifiers are too easy to conflict in 4 bytes. So I think it would be better to remove it entirely. We should also think about always having a length or size field, but allow it to be 0 if unknown, e.g. while streaming - but the size of the field is open - is 32 bits, variable length, or what. If variable length it cannot easily be updated after streaming.
  • Ad. 14 Nested vectors are good.

my inclusive comments:

  • Ad 1. I think we can remove padding - it causes a lot of complexity and it is not necessary on most platforms. On C, a flag could mark a platform is unaligned unfriendly - there already is - and then accessors would read differently - it already does in some cases.
  • need a 16-bit union type
  • ability to copy tables without knowing its type - requires some vtable annotation. Would allow some generated code to be library code, and allow some gateway processing without knowing the fully schema.
  • drop nested tables - they are unreasonably complex to get right, although absence of padding would simplify this.
  • (Ad 8.) We could use a varint format for some fields. The QUIC protocol adopted an unsigned big endian encoding where first byte bit 6 and 7 codes length 1, 2, 3, or 4 bytes. That works very well for size fields. For flatbuffer offsets this could also work if the type was signed, but I am afraid it would slow things down significantly.
  • (Ad 9.) ability to stream buffers while being written - always use signed offsets, also better for some languages that dislike unsigned types, especially if 64 bit - see StreamBuffers: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#streambuffers
  • Support for mixings beyond: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins - This would require support in the new format to avoid having subtypes always being remote tables instead of inline - also requires some thinking - but last I looked at it, it would be good.

I really would like to remove nested buffers - I has caused me so much headache and they haven't been properly implemented elsewhere. We could have a packed buffer format where multiple buffers an be embedded in the same file or memory block and referenced by some identifier without actually storing the buffer inside the other buffer. It could be an 8-byte random identifier. Buffers could also be given a random initial identifier in this format. Just a thought. This means how buffers are stacked or packed is not critical as long as they can be located.

@mikkelfj
Copy link
Contributor

We also need a proper NULL type for representing database data.

@AustinSchuh
Copy link
Contributor

Somehow I had persuaded myself that 9) doesn't need a format change and that it could be done by just building from the other end of the buffer and allocating space for the full message and vtable when creating the object. It would require a pretty massive API change though. Maybe I'm just hopeful.

My understanding of the premise of flatbuffers (and Cap'n'Proto, which we evaluated before picking flatbuffers) is that compression algorithms are wonderful, so don't burden the format with being clever and compact. Protobufs attempt that, and end up requiring a separate serialization/deserialization step. Sharing vtables goes against that.

Most of the rest of my feedback is at the API level. Happy to give it if there is interest.

@maxburke
Copy link
Contributor

  • One of the selection factors that resulted in us picking flatbuffers over other formats like protobufs was that flatbuffers doesn't use varints. If they pop up in v2, perhaps it could be an opt-in? Or maybe an attribute applied to fields?

  • I agree with @AustinSchuh about compression; I think flatbuffers' niche is that by default they are very efficient at runtime, even if they trade off space for that efficiency. In our use we transport flatbuffers over the wire in http that's gzipped/deflated/brotli'd, and on disk we persist them squashed with zstandard, so encoding tricks I think wouldn't really buy us much, but would hurt in our application use.

  • Please-oh-please add size prefixing by default.

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

If that's an explicit opt-in, then it seems fine.

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

@AustinSchuh
Copy link
Contributor

AustinSchuh commented Apr 27, 2020

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user. The offset for the scalar fields in the vtable is 0 if they aren't populated. See CheckField<> and GetField<> (for how it handles defaults) for the gory implementation details. (I've got a patch which implements has_ it in C++ if there is upstream interest)

Protobuf started out with required being the default and concluded that it was a bad idea. protocolbuffers/protobuf#2497 is a small part of the discussion.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user.

If something isn't publicly available, it doesn't exist from the user perspective. We should expose such information.

Additionally, I have an impression we're reaching the classic dilemma of speed vs size. So the first question we need to answer is - which one do we favor? For me, FB was always about performance, so if we need to keep padding or sacrifice some internal improvements for the sake of being fast, I would personally stick with that. If some improvement doesn't impact performance, then let's consider it.

@rw
Copy link
Collaborator

rw commented Apr 27, 2020

  1. Allow different size vtable offsets. [...] you can't undo it when your table grows.

Maybe a varint?

@rw
Copy link
Collaborator

rw commented Apr 27, 2020

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload? I know that that gets into "framing format" territory, but providing a set of feature flags at the beginning of a table (or at the beginning of an entire buffer) could allow a lot of flexibility at little cost.

It would probably just be an optional initial table at the beginning, containing metadata.

@mzaks
Copy link
Contributor

mzaks commented Apr 27, 2020

My 5 cents.

  • removing padding, is generally a good thing, I am a bit concerned with possible crashes. In FlatBuffersSwift padding is optional and I had issues with A5 chips (iPad2 for example) crashing on me. Could be that it can be mitigated with better library code though.
  • smaller vTable pointers, yes could be also a flag in the vTable size value. We don't need all 16bits to represent the number of fields, one bit can be spared to indicate if the relative offsets are 1 or 2 bytes long
  • remove 0 termination in strings. Yes please :). I read the concerns from @mikkelfj, but honestly I think 0 termination is just wrong, specifically as the string is utf8 encoded and can have 0 values. So relying on 0 byte to be end of the string is dangerous anyways. As a compromise we could have a special cstring type. This is what languages like Swift and Rust have. The native string representation is utf8, but for fast interop with C, there is zero terminated cstring.
  • speaking of strings, I would suggest to represent the length of the string with a varint format. Be it VLQ, FLIT, or something else. This would be a big win for short strings. FLIT suppose to be faster than VLQ, but I am concerned how linked C implementation ignores possible alignment issues.
  • 64-bit offsets, yes please and could we please allow cycles now :). I am ok if it is under a configuration flag and users need to explicitly opt in, sign it in blood on the schema definition. But being able to represent full graphs and not just DAG is big, as it opens up possibilities other formats can not allow. Specifically for object API. As object graphs can have cycles and we can encode them in FlatBuffers, just one to one. FlatBuffersSwift does it already and I am happy to help bring it to any other language. It would start with flatc identifying tasbles in schemas which have recursive (transitive recursive) definitions.
  • I have also ideas regarding possible, breaking / non breaking features for FlexBuffers, not sure if it is the right place to write them though.

@mustiikhalil
Copy link
Collaborator

My personal View on the following:

  1. Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management.
    Really like it, at least for swift, we would be able to handle stuff much smoother than the current implementation.

@maxburke

I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

@krojew

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

This is an amazing idea, actually. Which made me think if we can actually remove the vtables completely, since all the fields are going to be required, or optional as mentioned above can be presented as a Bool. This would allow us to remove the vtable, and use the Generated table that's already predefined.

example:
when trying to encode the monster object, the following happens
4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, offset.
4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, this will be removed and a reference will be created to the old encoded table 245, 245, 250, 0.
Now all we have to do is relate it the predefined table we have, where we can simply reference the number of the table instead.

@mzaks
Copy link
Contributor

mzaks commented Apr 27, 2020

Regarding the optional / required table field status mentioned by @maxburke. I think changing current behaviour is a bad idea in regards to backward and forward compatibility. Marking fields as non optional is a "lie" we make for API convenience. There are no guaranties that a field is present, as we have no control over, who created the buffer, given your system is not a closed one. When we define a field as required, we commit to not being able to deprecate this field. We also can't set a new field as required unless we can guarantee that no older clients which don't know this field exists, will send buffers to newer clients which require this field to exist.

Speaking of forwards and backwards compatibility, enums is another blind spot. Specifically the way enums are converted to JSON. I think in order to grant proper backwards and forwards compatibility enums need to be always represented as numbers, even though it looks nicer as text in JSON. But that breaks with new cases and also if a case has to be renamed.

@mzaks
Copy link
Contributor

mzaks commented Apr 27, 2020

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

There are two options how it can be solved in current FlatBuffers implementation.

  1. You disable default values and check if the value in vTable is 0 (I think most libraries have code for that)
  2. You introduce a struct which wraps the scalar value. As structs can't have default values and are zero cost if they have only one field, you get optional scalar value "for free".

I guess for the new version this kind of feature could be addressed in a more straight forward way, by being able to define a table scalar field as optional in the schema directly.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

Switching to required by default does not break future compatibility any more than the current attribute. On the other hand we gain possible optimizations and a more intuitive schemas. Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

@mzaks
Copy link
Contributor

mzaks commented Apr 27, 2020

Switching to required by default does not break future compatibility any more than the current attribute.

There are two options how one can design default behaviour:

  1. Convenience first. Enabling power users to be as "lazy" as possible.
  2. Safety first. Enabling novices and people new to the topic, not to shoot themselves in the foot in the long run, just because they did not understand all the details.

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

On the other hand we gain possible optimizations and a more intuitive schemas.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.
Also, experience shows people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required. I believe switching to required by default will be the intuitive way to go. We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Required things can always be stored inline. This is quite good for performance.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

Scalar types don't have a notion of null. That's another thing I would love to see - take a look at Option in Rust. Also, verifying a buffer is a separate subject, so let's not mix it in.

Side note, is anyone working on a verifier for Rust, like in cfb?

@mzaks
Copy link
Contributor

mzaks commented Apr 27, 2020

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.

Ok lets go with anecdotal 😀. In 2015 I worked on a city builder game where we stored user progress in FlatBuffers. The game was developed in Unity3D, but the town map itself was an isometric view. So a building position could be identified with a Position table which caries x and y as grid coordinates. In 2016 game designers introduced hills on the map. So Position carrying just x a y was non sufficient any more. We had to introduce z field. We introduced it and all worked perfectly smooth. Imagine we would introduce z as required field. What would happen? Probably nothing in development, as in development you mostly start from a fresh start, or have an admin tool to populate the city automatically. But when we would ship the change all the new version of the game would crash on start. Why? Because the stored game state have Position without z field and z field is now required. This would be a small disaster as it was a mobile game, deploying a fix on iOS can take days. So non of the existing player can play the game for days, you will have a short term impact of revenues going down and possible long term impact, of people installing another game of the same genre and abandoning your game all together.

What I am trying to visualise with this colourful example, is that with required being default the evolution of a schema becomes a mines field, which only people with experience will be able to avoid. To be honest with you, when we switched from Position(x, y) to Position(x, y, z) we would tap in the mine as well. It was our luck that we were protected by the sensible default behaviour of FlatBuffers. Because you are absolutely correct:

... people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required

Also regarding this:

We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

I am not sure why removing required altogether is not the goal? ProtoBuffers version 3 did it and I personally would vote for doing it in FlatBuffers too. 😉

Anyways I think, I wrote enough. The decision lays with @aardappel.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

@mzaks I think there's a misunderstanding somewhere, because your example is quite wrong in this case. First of all, adding a new field to a table (required or not) will not break existing code (assuming the usual schema evolution guidelines). Second - there's domain data known to be required and that gets marked as required, which is perfectly ok. Arguing that everything should be optional is over-defensive and will only lead to frustrations of not having the ability to express the data model properly in messages (with the additional burden of fact-checking every piece of data). Third - saying that someone did something is not a valid technical argument discussion :)

That being said, my opinion on optional data for a future protocol, if it ever happens, is:

  • Make required the default.
  • Add proper optional support for scalar types.

@stewartmiles
Copy link
Contributor

I agree strongly with @rw that using a format specifier as part of the header would definitely be a great way to evolve the format.

Lots of great comments on this thread, here's my thoughts to add to the mix:

  1. I really don't like the idea of removing all padding. One of the major benefits of FlatBuffers is the ability to read into memory and use, this removes this possibility for many architectures. If this functionality is preserved with a flag for those odd CPUs (i.e most) that require natural alignment to be efficient then I guess that's ok.
  2. Moving unions into a single field sounds good. 6-bytes vs. 8-bytes .. why though? If a developer is looking for a compact format - FlatBuffers definitely trades speed for size - then perhaps they could try something else?
  3. Removing the size field of vtable seems reasonable.
  4. Since this is code-gen'd sounds reasonable though more complicated when inspecting the format.
  5. Inline vtables SG assuming this can be handled at codegen time.
  6. Inline strings / vectors SG assuming this can be handled at codegen time.
  7. Removing 0-termination of strings seems like a recipe for things blowing up in special ways, especially since I would assume most C/C++ code still assumes zero terminated strings. It again seems like a trade-off between size / speed, without a zero terminate string C/C++ code potentially has to copy which seems bad. Perhaps for cases where the developer knows up front they could use a flag to strip the terminator and the C/C++ code would generate accessors that potentially copy if a caller needs a zero terminated string (yuck).
  8. 8-bit and 16-bit size fields sound ok assuming the conditions are evaluated at code gen time. Again, feels like a size vs. speed trade-off. Perhaps consider making this sort of thing optional?
  9. Constructing the buffer forwards would certainly more intuitive, though why did you construct backwards again? Per 26a3073 (wow that was a long code review a long time ago) The current implementation constructs these buffers backwards, since that significantly reduces the amount of bookkeeping and simplifies the construction API. no longer the case?
  10. Definitely always include the file identifier, a couple of bytes always being present allows the format to evolve. I vaguely remember a discussion about this :)
  11. For 64-bit offsets from the starting vector seems like a reasonable trade-off for flexibility.
  12. Anything that makes mutation easier would be great. Even in the case with shared v-tables - I know the implementation is more complex because you're basically doing copy-on-write - it would be great to more efficiently handle mutation and packing of a FlatBuffers so that use cases of other well used RPC formats can be replaced ;)
  13. Flexbuffers integration would be great to mix schema-less data into the API. Would be far better than implementing schemaless data in Flatbuffers which ends up being a common pattern.
  14. Nested vectors could be neat, not critical though, it's a tiny bit a math to convert a flat to nested vector.
  15. Built-in variable sized integers again sound like a space vs. speed trade-off, if fields are explicitly marked as variable sized. Before implementing a feature like this it's probably worth figuring out what the size savings and time cost would be for functionality vs. laying other compression schemes on top of FlatBuffers.

@adsharma
Copy link

adsharma commented Apr 27, 2020

Support for indexed tables where a flatbuffer table is annotated with "key" and "value" annotations. This sounded like a niche use case - but I believe it's an important one and likely to be occupied by something else if not addressed.

@krojew
Copy link
Contributor

krojew commented Apr 27, 2020

I think 0-terminated strings are not an issue. Removing C-style string getters would solve it.

@vglavnyy
Copy link
Contributor

  1. Remove all padding.

It is possible if use bit_cast<T> whenever in C++ where access to scalars. It will put C++ implementation on an equal footing with other languages than can’t use reinterpret cast.
It would be better to assume that alignment (padding) of a field may be a random number in the range [0..N] (not a power of 2). Internal implementation should not depend on a user-defined alignment that can be specified for every field in a schema.

  1. Allow different size vtable offsets.
  2. Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32.

For vtable offsets and types ASN1 or variable-length encoding can be used. It may be a good idea to make the length of fbs-message aligned to 4 or 8 to pre-fetch data without extra checking.
This is simple 1:2, or 1:4, or 1:8 decoder:

  auto x = load_uint32(bytes); // it can be uint16 or uint64
  auto offset = ((0u - (x & 1u)) & x | (x & 0xFFu) ) >> 1u; // [0x00-0x7F] or [0-0x7fffffff]
  bool is_1 = !(x & 1u);

FlexBuffer

Fast FlexBuffer with writing to pre-allocated memory (without any memory allocations) can be useful as a fast logging core.

@aardappel
Copy link
Collaborator Author

@mikkelfj not sure why you refer to some of these as "slow": like I said, these are all intended to be codegen-ed (or become a template/macro arg) so will all be maximally fast, certainly no slower than any existing feature. None of these feature are intended to have dynamic checks.

Not sure what you mean by strong JSON integration. JSON is text and not random access, so serves an entirely different use case than FlexBuffers. I am not sure what it would even mean to use JSON in this case, other than to store a string.

A new format could mean a new way to do file_identifier, I'd be open to that. Could be variable length.

"drop nested tables" .. what does that mean?

@aardappel
Copy link
Collaborator Author

@AustinSchuh forward building would be a big format change, since now children typically come before parents, and unsigned child offsets would be flipped to always point downwards in memory instead. Retaining the feature that these offsets always point one way and never can form a cycle would be good, I think.

A lot of people use FlatBuffers in cases where sparse / random access, or use with mmap are important, and those don't work with an external compression pass. I personally think there's a lot of value storing things more compactly in memory that is directly useable. Or at least, that is what FlatBuffers specializes in. None of these more compressed representations should be any slower (in fact, faster) than the current representation, they mostly just complicate codegen.

@aardappel
Copy link
Collaborator Author

@maxburke the varints would be very much optional, as they are definitely slower to read. They would be a type, so you'd explicitly write either int (as right now) or varint for a field.

See above about compression and efficiency.

Default required would be problematic for an evolving format. Protobuf came to the opposite conclusion after many years of experience, and FlatBuffers went along with that.

@aardappel
Copy link
Collaborator Author

aardappel commented Apr 27, 2020

@AustinSchuh

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.

@aardappel
Copy link
Collaborator Author

aardappel commented Apr 27, 2020

@rw

Maybe a varint?

That would make the distinction dynamic, and costly. Besides, we need to index into this table, which makes varints useless unless all the same size. The idea is that this is a static, codegen feature, associated with a particular type of table.

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?

No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.

@mustiikhalil
Copy link
Collaborator

Regarding the UUIDs, I meant at least the generated code can call a function to verify if its a UUID. since we still can pass garbage strings as a UUID. It would be nice to validate it before hand by the library.

I would love to see 1 too, but i am not sure if that would work normally with swift.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2021

I'd rather avoid anything except very core types in th FlatBuffer format, otherwise it never ends. We are just discussing how to add type aliases with attributes (#5597). These can be used to define new types with associated meaning, for example:

attribute uuid_format;

type uuid = [uint8:16] (uuid_format:4);

The above can only be used in structs. I think it would be nice to allow this in fields as well, and that could happen in a v2 format, but we don't have to define all kinds of specialized types. There are also GPS coordinates, 170 different timestamp formats, and so on. I'm happy with UTF-8 but that is about as advanced as I'd like to get.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2021

As to priorities, I'll keep this in mind, but it requires a bit more than an afterthought to put forward, and the plate is full atm.

Forward writing is a must. I am going to do them one way or the other at some point.

One thing I think has not been listed, which I think is important:
Currently default values, null values and optional values are all over the place. I'd like the schema to be much more well - congruent - or something, similar to SQL semantics. We can then map that to languages that have pointers (efficiently, not theoretially) rather than the other way around.

As to 64-bits. Important, but takes up way too much space unless done very carefully. Maybe there needs to be some flags early in the file to indicate the offset scale. Generated code does not have to support all variants, but should fail on unsupported ones.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2021

I'm a bit torn on padding. But I think the uses cases where alignment matters are better handled by copying the data, otherwise you just end up copying the entire buffer from some I/O device.

If we forego padding, the code becomes MUCH simpler and therefore also faster. When combined with forward writing (StreamBuffers), this allows for efficient streaming rewrites, for example from JSON to FlatBuffers in network sized chunks.

Another feature I'm not sure is given much though is the ability to clone parts of a buffer without full metadata knowledge - that is a lot of code generation for clone logic that shouldn't really be necessary. If the vtable has information about in-place vs offset this should be possible. This might counteract the suggestion to remove table sizes (which are otherwise pretty useless except for verification, but verification might be an important argument too?

And if we add flags to buffer header, we could indicate if the buffer is a DAG, a Tree, or a general graph (even if we generally disallow these, there are arguments for them when representation is more important than JSON translation and verification, but we do want to know what they are up front).

@vglavnyy
Copy link
Contributor

vglavnyy commented Jan 6, 2021

But I think the uses cases where alignment matters are better handled by copying the data, otherwise you just end up copying the entire buffer from some I/O device.

Flatbuffers has two main compound types: struct and table. We can only remove padding bytes from tables.
Every declared field or variable in C/C++ has alignment requirements. So the current implementation of C++ flatbuffers::struct is impossible without the use of padding inside a buffer memory. Also, every struct must be aligned to the first field inside the struct or to force_alignment: N boundary.
Note: The attribute force_alignment: N can be applied to struct declaration or to vector fields inside a table (see Parser::ParseVector for details). It should be taken into account.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 6, 2021

@vglavnyy absolutely agree. Structs must remain padded internally as is with zero filled padding where practically possible (it is hard to guarantee in C but easy to achieve in praxis). Also, force_align as a useful facility here.
While struct padding do make the schema compiler more complex, it is easy for the code generator because all offsets are precomputed and do not depend on how the buffer is generated.

The big question is if offset to a struct from buffer start must also be aligned. That is more tricky to achieve but C/C++ compilers would expect that. But it only works when the buffer is also aligned. I'm not really sure what to say here.

Another thing is that if the buffer ends up having some alignment, I would like to see that in the header of the buffer because alignment so you can check if standard 8 byte alignment is enough via malloc, or you need to do more. This isn't entirely a schema thing since a possible 64-byte cacheline aligned array might never actually be added to the buffer.

@CasperN
Copy link
Collaborator

CasperN commented Jan 6, 2021

On verified UUIDs, I think a more general solution is to implement a mechanism for custom validation (e.g. plugins to code generation) and release a "common types" schema that uses that mechanism (much like Google.Protobuf.WellKnownTypes).

@adsharma
Copy link

adsharma commented Jan 6, 2021 via email

@victorstewart
Copy link

I support 9 RE #6188

something else very powerful would be the ability to construct nested buffers inline in a single meta operation.

almost all of my buffer usage is nested, and i shed outer layers and pass around nested objects throughout my application server as i process messages. currently I have to construct each buffer independently, and then copy it into the next one... sometimes I have 3 or 4 layers of this, aka 3 or 4 copies.

i've flirted with the idea of writing this myself (specifically expanding the simple single level sequential custom application serialization protocol i use for some of my networking). so i'd definitely be interested in joining a collaboration.

@mikkelfj
Copy link
Contributor

@victorstewart FlatCC supports creating nested buffers while the parent buffer is being constructed. This ensures correct alignment regardless of the nested buffer content. I hardly ever use nested buffers though.

@GregBowyer
Copy link

I am late to this party (and if this exists then I would love to know!)

Supporting Lexicographic ordering on structs for key demarked things would be good. We use flatbuffers for storing things in K/V stores but typically encode the keys external to flatbuffers.

My dream would be

// Struct could also be Key ?
struct SomeKey {
   part1: uint32 lexicographic; // or big_endian ...?
   part2: uint32 lexicographic;
}

table Payload {
  name: xxx;
  age: xxx;
  address: xxx;
  ...
}

table Customers {
  key: SomeKey key;
  val: Payload
}

// Maybe this could exist for splitting things such that flatbuffers can have both parts?
key_type SomeKey;
root_type Customers;

@hassila
Copy link
Contributor

hassila commented Apr 25, 2022

Even later: 1, 4, 8, 9, 10 (definitely always frame with a length field for streaming, more important than file id IMHO).

@aardappel
Copy link
Collaborator Author

Adding to item 15) in the original list (varints): It would even more sense to have a dedicated "vector of varint", since that would make multiple varints even cheaper than multiple fields of them. The size field would be a varint itself, and you could optionally (or even by default) have these vectors be inline in the table for super-compact small vectors. They'd be accessed by iterators instead of indexing.

Similarly, if you're going to have inline strings (item 8), then having the size field be varint (instead of having to declare 8/16/32) would make a lot of sense.

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

This issue is stale because it has been open 6 months with no activity. Please comment or label not-stale, or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Mar 4, 2023
@github-actions
Copy link

This issue was automatically closed due to no activity for 6 months plus the 14 day notice period.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 18, 2023
@dbaileychess dbaileychess reopened this May 9, 2023
@dbaileychess dbaileychess added not-stale Explicitly exempt marking this stale and removed stale labels May 9, 2023
@vharron
Copy link

vharron commented May 17, 2023

If Java doesn't have a buffer type that can be indexed with a 64-bit index, then it can never load buffers >2GB presumably. However, even such smaller buffers can still contain 64-bit indices, since they can be safely convert to 32-bit when used as index. How do Java people deal with large buffers?

For Java, Is it possible to create an array with 64 bit entries as a set of multiple sequential 32 bit arrays. The

getElement(long idx)

function would just shift and mask idx into an array number and an offset into that array.

@vharron
Copy link

vharron commented May 17, 2023

Remove the 2nd field of the vtable. This field stores the table size, but it is never used in any implementations. This was intended for a streaming API that never happened.

How much space does this take? If trivial, should it be left in to leave the door open for read streaming?

Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.

I think it's okay to make an opinionated stand on this. People should use string_view and if they aren't, they can copy the string_view into a null terminated string.

@aardappel
Copy link
Collaborator Author

For Java, Is it possible to create an array with 64 bit entries as a set of multiple sequential 32 bit arrays

That sounds incredibly slow, since anything larger (like a string) would have to be lifted out element by element into a secondary buffer before it can even be converted to a string since anything can straddle a buffer boundary.. All code we currently have that can work with the underlying array would stop working or need copies.

How much space does this take? If trivial, should it be left in to leave the door open for read streaming?

This is a single 16-bit quantity in the vtable which is currently still in there in all implementations, and typically set to 0. So yes, it could still be used for some future functionality, maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-stale Explicitly exempt marking this stale
Projects
None yet
Development

No branches or pull requests