Which of the multibase formats preserves lexicographic order? #124

CMCDragonkai · 2021-10-21T03:22:39Z

I've come across a situation requiring the ability to base-encode but preserve the lexicographic-order of the input bytes.

That is the order of the base encoding alphabet should be in the same order as those bytes that are mapped to the base encoding.

An example of this library is https://github.com/deanlandolt/base64-lex which is a base64 that preserves lexicographic order.

Which base encodings in multibase support preservation of lexicographic order?

rvagg · 2021-10-21T04:36:17Z

I don't believe we have such a base encoding in multibase, see https://github.com/multiformats/multibase for the current list. I can imagine a use for this (and can imagine how Dean is using it too!) so it might be worth getting an entry in to the multibase table (the linked repo, see multibase.csv). It might take a bit of discussion with others over there as we haven't added a new one in a while and it doesn't get that much attention. It would also be good if you had some working code that worked as a js-multiformats multibase codec (maybe in this repo, although it could also just be an external package that we link in the README here like we do with codecs and hashers). Unfortunately Dean's code is using Buffer which we try and avoid around here, but it may be pretty straightforward to either reuse the base64 encoder here and do the mapping as a second step, or insert an adjusted dictionary to coding step(s). I'm not sure but there's some test fixtures to work with at least.

CMCDragonkai · 2021-10-21T04:43:19Z

This article ~~https://github.com/deanlandolt/base64-lex~~ https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding explains how to make base64 preserve lexicographic order. From the description it seems base58btc is already in lexicographic order. However I'm not sure what the deal is with padding and endianness.

CMCDragonkai · 2021-10-21T04:43:41Z

Sorry I meant this article: https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding

rvagg · 2021-10-21T04:43:59Z

Looking at bit deeper, maybe we're already covered? Have a look at the dictionaries used for the bases in this repo and see which are properly ordered. https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js suggests that base58btc (but not base58flickr) might preserve lexicographic order. It doesn't look like any of the base64's are but the base32's seem to be. I'm not sure about the cross-byte interactions, however, maybe that rules out 58 and you'd have to go with 32?

rvagg · 2021-10-21T04:45:33Z

Right, so to expand on the above point, I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

rvagg · 2021-10-21T04:50:04Z

tbh, I don't even know if that same problem persists for base32 variants; you're going to have to do some experimentation I think

CMCDragonkai · 2021-10-21T04:54:54Z

Can you expand on what you mean by this?

I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

I imagine hex encoding should preserve lexicographic order. But I guess this also depends on what is lexicographic order? My understanding is that the order of a string should be based on the comparison on each individual byte in the string. Since bytes are just 8 bit numbers, then if you have [0, 255] that should be less than [1, 255].

In JS, it says string sort is based on utf-16 code units. I'm not sure if that is equivalent to my understanding.

rvagg · 2021-10-21T05:55:51Z

Yeah, actually I'm not at all sure how to make this work properly, with all of the bases (that are not base-256) you're dividing up bytes into pieces and having to span across bytes as you slice. Even for base64 you're slicing the bytes into 6-bit pieces, so you get the first 6 bits of a byte, then the remaining 2 and 4 of the next, then the remaining 4 of the next and 2 at the start of the next, then the remaining 6, etc. etc. So how do you retain proper sorting when you're splitting up bytes like that? Maybe I'm not fully grokking what base64-lex is doing but I find it difficult to imagine. Base-16/hex is splitting bytes in half and presenting 2 characters for each byte, I guess that might retain sorting if you did it right. Then it all comes down to endianness which is just a mess!
This is definitely something worth experimenting with, make random strings, encode them and see if they retain the sorting order you want.

CMCDragonkai · 2021-10-21T09:31:18Z

I did tests on 100,000 Ids generated by base58btc, they were all correctly sorted.

Whereas less than 100 ids generated with base64 and I got a sort order error. So I'm guessing base58btc is suffiicent.

Did my tests here: MatrixAI/js-id#8

rvagg · 2021-10-22T02:20:09Z

Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.

CMCDragonkai · 2021-10-22T03:57:00Z

Does it work in the same principle that hex encoding works. For hex, every byte can be presented by 2 hex digits. So `0 - F` is half of a byte, then `0 - F` is the next half. Hex encoding's bit-order also make sense. So for `FF`, the left most hex is the most significant 4 bits, and the right hex is the least significant 4 bits. Thus the ordering is correct for `01` all the way to `FF`. Hex numbering is also 0-1 then A-F, and that is also correct in terms of lexicographic order. For a string of multiple bytes, each byte gets encoded into 2 hex digits (if padded), or just 1 or 2 hex digits. These should be equivalent from a sorting perspective. Anyway for `[123, 124]` that becomes `7b7c` and `[124, 125]` is `7c7d`. Imagine we wanted to compare each other for sorting. Assuming lexicographic sort is based on each character (which in this case it is not precisely the case, because JS uses utf-16), then we are comparing each character by character. ``` 7b7c 7c7d ``` The `7b7c` would be considered to come before `7c7d`, simply at index 1: `b < c`. In base58btc the exact relationship between bytes to digits is not as exact as hex.

…

On 22/10/2021 1:20 pm, Rod Vagg wrote: Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#124 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAE4OHM4PJ44Q2KP7OJW4ITUIDC6HANCNFSM5GNCQ6OA>.

CMCDragonkai · 2021-10-22T04:00:50Z

I realise that I don't understand base58btc internals, I can do a test with 100,000 random strings, and sort them, then convert them to base58btc and then sort it again and compare for a sanity check.

CMCDragonkai · 2021-10-22T06:58:23Z

Ok my tests show that base58btc doesn't preserve lexicographic order. It did in my js-id tests, but that's probably due to the timestamp portion at the beginning of the data.

import type { Codec } from 'multiformats/bases/base';

import crypto from 'crypto';
import { bases } from 'multiformats/basics';

function randomBytes(size: number): Uint8Array {
  return crypto.randomBytes(size);
}

type MultibaseFormats = keyof typeof bases;

const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
  const codec = bases[k];
  basesByPrefix[codec.prefix] = codec;
}

function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
  const codec = bases[format];
  return codec.encode(id);
}

function fromMultibase(idString: string): Uint8Array | undefined {
  const prefix = idString[0];
  const codec = basesByPrefix[prefix];
  if (codec == null) {
    return;
  }
  const buffer = codec.decode(idString);
  return buffer;
}

const originalList: Array<Uint8Array> = [];

const total = 100000;

let count = total;
while (count) {
  originalList.push(randomBytes(36));
  count--;
}

originalList.sort(Buffer.compare);
const encodedList = originalList.map(
  (bs) => toMultibase(bs, 'base58btc')
);

const encodedList_ = encodedList.slice();
encodedList_.sort();

// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order

for (let i = 0; i < total; i++) {
  if (encodedList[i] !== encodedList_[i]) {
    console.log('Does not match on:', i);
    console.log('original order', encodedList[i]);
    console.log('encoded order', encodedList_[i]);
    break;
  }
}

const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
  // @ts-ignore
  if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
    console.log('bug in the code');
    break;
  }
}

Changing it to base16 does preserve order.

CMCDragonkai · 2021-10-22T07:08:30Z

Looking at the errors and trialing them out:

base2 - no error
base8 - no error
base16 - no error
base16upper - no error
base32hex - no error
base32hexupper - no error
base32hexpad - no error
base32hexpadupper  - no error

base10 - error
base32 - error
base32upper - error
base32pad - error
base32padupper - error
base32z - error
base36 - error
base58btc - error
base58flickr - error
base64 - error

According to rfc4648, the base32hex does preserve sort order. https://datatracker.ietf.org/doc/html/rfc4648#section-7

So there you have it, the conclusion. Pretty much

CMCDragonkai · 2021-10-22T07:13:44Z

JS's binary string form also preserves sort order #122.

CMCDragonkai · 2021-10-22T07:39:27Z

I suggest that we should add in a base64 variant that preserves order since base32 is quite a low base. And probably change over identity encoding to binary string form mentioned in #122.

This was referenced Oct 21, 2021

WIP: Integrating Automatic Indexing into DB MatrixAI/js-db#2

Closed

Not all base encodings maintain lexicographic-order MatrixAI/js-id#7

Closed

CMCDragonkai mentioned this issue Oct 22, 2021

Canonicalise node ID representation MatrixAI/Polykey#261

Closed

CMCDragonkai mentioned this issue Oct 22, 2021

Add in a sortable base64 encoding from multibase MatrixAI/js-id#9

Open

CMCDragonkai mentioned this issue Dec 14, 2021

Supporting trusted seed nodes (nodes domain refactoring, static sources for seed nodes) MatrixAI/Polykey#289

Merged

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which of the multibase formats preserves lexicographic order? #124

Which of the multibase formats preserves lexicographic order? #124

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021 •

edited

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 21, 2021

rvagg commented Oct 21, 2021

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021 •

edited

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021 via email •

edited

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

Which of the multibase formats preserves lexicographic order? #124

Which of the multibase formats preserves lexicographic order? #124

Comments

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021 • edited

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 21, 2021

rvagg commented Oct 21, 2021

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021 • edited

rvagg commented Oct 21, 2021

CMCDragonkai commented Oct 21, 2021

rvagg commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021 via email • edited

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 22, 2021

CMCDragonkai commented Oct 21, 2021 •

edited

CMCDragonkai commented Oct 21, 2021 •

edited

CMCDragonkai commented Oct 22, 2021 via email •

edited