Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which of the multibase formats preserves lexicographic order? #124

Open
CMCDragonkai opened this issue Oct 21, 2021 · 16 comments
Open

Which of the multibase formats preserves lexicographic order? #124

CMCDragonkai opened this issue Oct 21, 2021 · 16 comments

Comments

@CMCDragonkai
Copy link

I've come across a situation requiring the ability to base-encode but preserve the lexicographic-order of the input bytes.

That is the order of the base encoding alphabet should be in the same order as those bytes that are mapped to the base encoding.

An example of this library is https://github.com/deanlandolt/base64-lex which is a base64 that preserves lexicographic order.

Which base encodings in multibase support preservation of lexicographic order?

@rvagg
Copy link
Member

rvagg commented Oct 21, 2021

I don't believe we have such a base encoding in multibase, see https://github.com/multiformats/multibase for the current list. I can imagine a use for this (and can imagine how Dean is using it too!) so it might be worth getting an entry in to the multibase table (the linked repo, see multibase.csv). It might take a bit of discussion with others over there as we haven't added a new one in a while and it doesn't get that much attention. It would also be good if you had some working code that worked as a js-multiformats multibase codec (maybe in this repo, although it could also just be an external package that we link in the README here like we do with codecs and hashers). Unfortunately Dean's code is using Buffer which we try and avoid around here, but it may be pretty straightforward to either reuse the base64 encoder here and do the mapping as a second step, or insert an adjusted dictionary to coding step(s). I'm not sure but there's some test fixtures to work with at least.

@CMCDragonkai
Copy link
Author

CMCDragonkai commented Oct 21, 2021

This article https://github.com/deanlandolt/base64-lex https://www.codeproject.com/Articles/5165340/Sortable-Base64-Encoding explains how to make base64 preserve lexicographic order. From the description it seems base58btc is already in lexicographic order. However I'm not sure what the deal is with padding and endianness.

@CMCDragonkai
Copy link
Author

@rvagg
Copy link
Member

rvagg commented Oct 21, 2021

Looking at bit deeper, maybe we're already covered? Have a look at the dictionaries used for the bases in this repo and see which are properly ordered. https://github.com/multiformats/js-multiformats/blob/master/src/bases/base58.js suggests that base58btc (but not base58flickr) might preserve lexicographic order. It doesn't look like any of the base64's are but the base32's seem to be. I'm not sure about the cross-byte interactions, however, maybe that rules out 58 and you'd have to go with 32?

@rvagg
Copy link
Member

rvagg commented Oct 21, 2021

Right, so to expand on the above point, I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

@rvagg
Copy link
Member

rvagg commented Oct 21, 2021

tbh, I don't even know if that same problem persists for base32 variants; you're going to have to do some experimentation I think

@CMCDragonkai
Copy link
Author

CMCDragonkai commented Oct 21, 2021

Can you expand on what you mean by this?

I think with base58btc you're slicing up the bits across byte boundaries. Padding might be a problem, but I think the bigger problem is the loss of whole bytes being properly represented.

I imagine hex encoding should preserve lexicographic order. But I guess this also depends on what is lexicographic order? My understanding is that the order of a string should be based on the comparison on each individual byte in the string. Since bytes are just 8 bit numbers, then if you have [0, 255] that should be less than [1, 255].

In JS, it says string sort is based on utf-16 code units. I'm not sure if that is equivalent to my understanding.

@rvagg
Copy link
Member

rvagg commented Oct 21, 2021

Yeah, actually I'm not at all sure how to make this work properly, with all of the bases (that are not base-256) you're dividing up bytes into pieces and having to span across bytes as you slice. Even for base64 you're slicing the bytes into 6-bit pieces, so you get the first 6 bits of a byte, then the remaining 2 and 4 of the next, then the remaining 4 of the next and 2 at the start of the next, then the remaining 6, etc. etc. So how do you retain proper sorting when you're splitting up bytes like that? Maybe I'm not fully grokking what base64-lex is doing but I find it difficult to imagine. Base-16/hex is splitting bytes in half and presenting 2 characters for each byte, I guess that might retain sorting if you did it right. Then it all comes down to endianness which is just a mess!
This is definitely something worth experimenting with, make random strings, encode them and see if they retain the sorting order you want.

@CMCDragonkai
Copy link
Author

I did tests on 100,000 Ids generated by base58btc, they were all correctly sorted.

Whereas less than 100 ids generated with base64 and I got a sort order error. So I'm guessing base58btc is suffiicent.

Did my tests here: MatrixAI/js-id#8

@rvagg
Copy link
Member

rvagg commented Oct 22, 2021

Well that's fascinating! What can we say about the limitations of these tests? Am I correct that these IDs you're testing are quite short, is this likely to scale to arbitrary lengths? This might be interesting for others who come looking for something similar.

@CMCDragonkai
Copy link
Author

CMCDragonkai commented Oct 22, 2021 via email

@CMCDragonkai
Copy link
Author

I realise that I don't understand base58btc internals, I can do a test with 100,000 random strings, and sort them, then convert them to base58btc and then sort it again and compare for a sanity check.

@CMCDragonkai
Copy link
Author

Ok my tests show that base58btc doesn't preserve lexicographic order. It did in my js-id tests, but that's probably due to the timestamp portion at the beginning of the data.

import type { Codec } from 'multiformats/bases/base';

import crypto from 'crypto';
import { bases } from 'multiformats/basics';

function randomBytes(size: number): Uint8Array {
  return crypto.randomBytes(size);
}

type MultibaseFormats = keyof typeof bases;

const basesByPrefix: Record<string, Codec<string, string>> = {};
for (const k in bases) {
  const codec = bases[k];
  basesByPrefix[codec.prefix] = codec;
}

function toMultibase(id: Uint8Array, format: MultibaseFormats): string {
  const codec = bases[format];
  return codec.encode(id);
}

function fromMultibase(idString: string): Uint8Array | undefined {
  const prefix = idString[0];
  const codec = basesByPrefix[prefix];
  if (codec == null) {
    return;
  }
  const buffer = codec.decode(idString);
  return buffer;
}

const originalList: Array<Uint8Array> = [];

const total = 100000;

let count = total;
while (count) {
  originalList.push(randomBytes(36));
  count--;
}

originalList.sort(Buffer.compare);
const encodedList = originalList.map(
  (bs) => toMultibase(bs, 'base58btc')
);

const encodedList_ = encodedList.slice();
encodedList_.sort();

// encodedList is the same order as originalList
// if base58btc preserves lexicographic-order
// then encodedList_ would be the same order

for (let i = 0; i < total; i++) {
  if (encodedList[i] !== encodedList_[i]) {
    console.log('Does not match on:', i);
    console.log('original order', encodedList[i]);
    console.log('encoded order', encodedList_[i]);
    break;
  }
}

const decodedList = encodedList.map(fromMultibase);
for (let i = 0; i < total; i++) {
  // @ts-ignore
  if (!originalList[i].equals(Buffer.from(decodedList[i]))) {
    console.log('bug in the code');
    break;
  }
}

Changing it to base16 does preserve order.

@CMCDragonkai
Copy link
Author

Looking at the errors and trialing them out:

base2 - no error
base8 - no error
base16 - no error
base16upper - no error
base32hex - no error
base32hexupper - no error
base32hexpad - no error
base32hexpadupper  - no error

base10 - error
base32 - error
base32upper - error
base32pad - error
base32padupper - error
base32z - error
base36 - error
base58btc - error
base58flickr - error
base64 - error

According to rfc4648, the base32hex does preserve sort order. https://datatracker.ietf.org/doc/html/rfc4648#section-7

So there you have it, the conclusion. Pretty much

@CMCDragonkai
Copy link
Author

JS's binary string form also preserves sort order #122.

@CMCDragonkai
Copy link
Author

I suggest that we should add in a base64 variant that preserves order since base32 is quite a low base. And probably change over identity encoding to binary string form mentioned in #122.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants