Possible IETF (RFC) standardization document #8

DonaldTsang · 2019-06-15T08:59:50Z

https://github.com/json5/json5 is currently proposing to add base64 support for JSON's superset.
They would like to develop an RFC, and I would like to discuss with them about safe encoding.
Is it possible to have an RFC proposal to accompany this repo? Or is that unnecessary?
json5/json5#190

kstenerud · 2019-06-15T09:02:51Z

Sure thing. Standardization is always good :)

d3x0r · 2019-06-16T08:13:09Z

what/how would base64 represent ?

kstenerud · 2019-06-16T09:28:20Z

It would represent any binary blob of data.

The difference between this an other base64 representations are:

The alphabet uses standard alphanumeric characters (a-zA-Z0-9), and uses - and _ as the extra characters, which are safe in every known major text format, and for names in all modern filesystems.
The alphabet uses the same ordering as the UTF-8/ASCII representations of the characters, so the encoded data sorts in the same order as the decoded data.
Whitespace is supported at any point in the encoded data.
There is no padding, because it isn't needed (resulting in a smaller encoded size).
This spec includes a variant with a length header for use when there's no clear delimiter present (this won't be the case in JSON, so it doesn't matter to the JSON spec).

BTW, you may also want to encourage adoption of the safe85 spec, since it's also safe for all modern text formats, and encodes data to a smaller size.

They may also find some inspiration from https://github.com/kstenerud/concise-encoding/blob/master/cte-specification.md

d3x0r · 2019-06-16T11:41:09Z

many of the differences you list are not differences; they are notable points I suppose.

so your order is '-', '0-9', 'A-F', '_', 'a-f' ... so something that no other base64 encoder resembles.

and ascii85 puts all symbols at the end , which defeats same-sort-order

... I was going to mention my decoder supports all the combinations of these...

62	63	usage
`+`	`/`	Base64 encoding (first listed on wikipedia)
`$`	`_`	what I use... is JS identifier compatible (unlike '-')
`.`	','	using '.' for filenames, and Base64 encoding for IMAP mailbox names (',')
'-'		(part of url safename, the _ being listed above)

But then none of that really applies; since the whole map would have to change.

Safe85...
Looks like a lot of math for 6% savings (1.33:1.25), if the packet is long enough to benefit from that, it could also just be gzipped.

(weak argument, just something i leveraged) Base64 character pairs can be used in a lookup table of 4096 entries; which is a nice roundnumber... but can be used directly for a wide hash index.

I don't see how you can fit 'whispace anywhere' with 'no padding' .. I suppose you're embedding it in a string? So, I wouldn't know if it was a string or binary?

kstenerud · 2019-06-16T14:02:15Z

If you have delimiters already, you don't need padding. If there are no delimiters, you need the length prefix variant (which guarantees truncation detection, something padding can't do).

Anything higher than base64 will require more processing power due to the math. But then again, once you get to high enough throughput requirements, you'd be better off going for a binary format like https://github.com/kstenerud/concise-encoding/blob/master/cbe-specification.md which doesn't need any of this trickery.

It comes down to how much weight is given to human readableness on the wire vs processing cost vs bandwidth cost. Everything is a compromise.

d3x0r · 2019-06-16T16:07:46Z

re CBE; that puts a lot of magic numbers into the encoding and you might as well use like protobufs, or BSON :)
UTF8 encoding bytes into codepoints is effectively 1.5, and at best (if you supported extended 42 bit encoding) it could approach 1.33; which is where base64 starts...
I hadn't really considered (previously) a base (85, which is 5 * 17) because would seem to useless space would have been more than the saved space, resulting in a gain... 1.25 is compelling; and certainly there's the ability to do long math mods and divs; (and would probably be less than an extra pass of gzip)

kstenerud · 2019-06-16T16:30:13Z

Magic numbers are important; that's how you differentiate the data types efficiently. Protobufs solves a different problem than BSON/JSON/CBE. It doesn't include type data in the encoding, which means that you can only decode if you have an exact copy of the schema. And BSON is too bulky and wasteful, unfortunately.

The UTF-8 codepoint based encodings are designed to get around the twitter character-length limitation. They're not actually smaller. They can't get more efficient in byte length than the byte-oriented encodings like base64 and 85 and 90.

d3x0r · 2019-06-16T17:19:13Z

exact integer size is fairly irrelevant, subject to implementation by the receiving platform/interpreter/environment...

int, float, are about the only two categories. These are easily capturable in [0-9.-E+] and themselves; add [:TZ] and you have distinguishable dates (A format of data that is often usable as a type itself).
identifiable.
Strings are easy to denote - ""
objects and arrays of other values {} ()
and well you get the idea...

and yes, [ and { are just 'magic numbers' but they don't require an accompanying document, but can instead be intrepreted using common programming knowledge.

(All sorts of distingusable data, without 'magic numbers'... err I lie 'ab' is a magic number for array buffer, 'u8', 'i8', ...'f32' etc... but that's part of a higher level than the syntax. )

though really this is all divergant from 'binary data transport across text transports'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible IETF (RFC) standardization document #8

Possible IETF (RFC) standardization document #8

DonaldTsang commented Jun 15, 2019 •

edited

kstenerud commented Jun 15, 2019

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019 •

edited

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019

d3x0r commented Jun 16, 2019

Possible IETF (RFC) standardization document #8

Possible IETF (RFC) standardization document #8

Comments

DonaldTsang commented Jun 15, 2019 • edited

kstenerud commented Jun 15, 2019

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019 • edited

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019

d3x0r commented Jun 16, 2019

kstenerud commented Jun 16, 2019

d3x0r commented Jun 16, 2019

DonaldTsang commented Jun 15, 2019 •

edited

kstenerud commented Jun 16, 2019 •

edited