Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive structures of primitive codecs #174

Open
ConsciousCode opened this issue May 14, 2020 · 4 comments
Open

Recursive structures of primitive codecs #174

ConsciousCode opened this issue May 14, 2020 · 4 comments

Comments

@ConsciousCode
Copy link

ConsciousCode commented May 14, 2020

Instead of simply enumerating every possible codec, why don't we define an exhaustive set of primitives and composing operations which can be assigned an id for reduced size? At present they're hardly self-describing, an implementation has to already know how to decode a codec.

In essence, the current multicodec table would be given an additional column representing decomposition into more primitive types, and codes assigned to more primitive types and compositional operations. This way, an implementation that doesn't know a higher-order codec can fetch an up-to-date table and know how to decompose it. A "primitive" is defined as a row that lacks a decomposition column. Initially, all codecs are primitives, but as decomposition encoding improves, they can be redefined as higher-order without affecting older implementations. Older implementations already have logic defined for decomposing former primitives, while newer implementations either define similar special decoders or defer to the decomposition.

For some vague examples of what this might look like:

name tag code composition description
compact composition ? byte_order:{be, le} bit_order:{be, le} bits:int Bit-for-bit encoding of an integer of arbitrary size
be-uint8 memory ? compact(be, be, 8) Big-endian byte order, big-endian bit order, 8-bit unsigned integer
ones composition ? bits:int offset:int $encodes:int $value:(2**bits - offset) One's complement signed integer
ieee-754-binary32 memory ? sign:bit exponent:ones(be, 8, 127) fraction:compact(be, be, 22) IEEE 32-bit floating point

The most primitive type would be a bit, and compositions would emerge from operations which consume those bits. Since the multicodec table is already filled with some highly compositional types under this scheme, maybe it'd be better to create a new project?

@Stebalien
Copy link
Member

What you're describing here is a general-purpose type language. I'd love that doesn't really obviate the need for the codec table in general. In addition to the structural information, the codecs give semantic information.

For example, the "sha256" codec doesn't tell you what the structure of the data is, it tells you that the data is a sha256 digest.

@ConsciousCode
Copy link
Author

ConsciousCode commented May 15, 2020

Semantic information can be done via an equivalent of C's typedef. I'd be kind of surprised if no one's made such a language, surely it has plenty of utility? I'm finding it hard to google for

@Stebalien
Copy link
Member

Stebalien commented May 15, 2020

That's basically what this table is. It's the type declarations without the type definitions. The codes are effectively "short names" for the type names.

You're right, there are already languages like this. For semantic information, there's an entire field of research (search for semantic web). For types, you'd need a language to describe a mapping between arbitrary serialized objects to structured data. Google points to https://en.wikipedia.org/wiki/Data_Format_Description_Language.

@warpfork
Copy link
Contributor

warpfork commented May 15, 2020

+1 to Stebalien's comments.

My favorite tincture when I get runaway thoughts in this area is to read about Gödel's Constructable Universes -- specifically the fact that they're not unique -- and remind myself that Gödel numberings are therefore unavoidable. If one decomposes "primitives" far enough, one always reaches a depth where further increases in generality mean proportional decreases in meaning itself. One reaches a Gödel numbering.

Multicodecs are just jumping to the point and making the numbering at a level that's practical and useful to us in the level of the world where we're worried about distinguishing, well, codecs. The more we decompose it, the longer of a Church encoding we get; and while that's a fascinating exercise, it doesn't help do much...

...until the point one turns around and uses that to build an interpreter. But then, ah, there we have it: we've reached the point where it's effectively inventing a new language. (And signing on to all the joys of designing, building, maintaining, and growing an ecosystem around that language/interpreter!) And if we do that...

At some point, someone might come along and observe that there's nothing particularly special about this encoding we've made versus some other Church encoding someone else has made. Because both encodings are effectively Gödelian Constructable Universes, that observation would be correct! Neither is clearly "purer" than the other. At the same time, if we each did the work particularly effectively, the Shannon entropy of the encoding should be maximum, so each encoding is indistinguishable from any pattern matching on its contents! And so...

Then to distinguish them, you'd need...

A magic number. Prepended somewhere.


... you see how it recurses :) At some point, one simply has to choose a place to stand one's ground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@Stebalien @warpfork @ConsciousCode and others