Identity encoding decoding doesn't produce the same data #122

CMCDragonkai · 2021-10-17T02:53:04Z

I'm not sure if identity encoding is meant to be used like this, but I noticed that after decoding, you don't get the same data:

import { bases } from 'multiformats/basics';

const codec = bases['identity'];

const u = new Uint8Array([
    6, 22, 184, 240, 237, 178,
  112,  0, 150, 137, 182,  54,
  220,  1, 217, 221
]);

const s = codec.encode(u);

const u_ = codec.decode(s);

console.log(u_);

/*
Uint8Array(36) [
    6,  22, 239, 191, 189, 239, 191, 189,
  239, 191, 189, 239, 191, 189, 112,   0,
  239, 191, 189, 239, 191, 189, 239, 191,
  189,  54, 239, 191, 189,   1, 239, 191,
  189, 239, 191, 189
]
*/

rvagg · 2021-10-18T02:52:20Z

You're hitting limitations of JavaScript's UTF-8 handling. There are some bytes that JavaScript just won't properly preserve during a bytes->string->bytes round-trip. The in-built assumption is that conversion to UTF-8 from bytes involves actual UTF-8 characters, unlike some languages, such as Go which can []byte(string([]byte(...))) without loss (i.e. their strings can hold non-UTF-8 bytes).

To illustrate, take your 3rd byte, which can't be represented as UTF-8 (note how the first 2 are present in the round-trip):

> new TextDecoder().decode(new Uint8Array([184]))
'�'
> new TextDecoder().decode(new Uint8Array([184])).charCodeAt(0)
65533
> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([184])))
Uint8Array(3) [ 239, 191, 189 ]

So you can see that invalid UTF-8 bytes get converted to U+FFFD, i.e. 65533, which is the sequence of 3 bytes you see repeated in your resulting array: 239, 191, 189 - every time you see these, you can assume that it's a non-UTF-8 byte that got lost in translation.

The identity multibase doesn't have much choice here, it's only safe to use with bytes that can be properly converted with JavaScript to strings, or use a multibase that maps characters to avoid this problem (which is one of the points of using base encoding!).

I hope that helps explain the situation, even if it probably doesn't give you an easy solution.

CMCDragonkai · 2021-10-18T04:21:14Z

I used codePointAt to convert to JS binary strings and back. https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary

Maybe that can be used instead?

rvagg · 2021-10-18T06:44:27Z

Hm, that might not be a bad idea since codepoint addressing is now standard across runtimes.

CMCDragonkai · 2021-10-18T06:48:56Z

Yea I used it for the above example and I compared it to multibase to see if there was any differences.

https://github.com/MatrixAI/js-id/blob/4ea34f2b50e8f259576fc2f8bb9f80d9a167e1a1/src/utils.ts#L75-L85

function toString(id: Uint8Array): string {
  return String.fromCharCode(...id);
}

function fromString(idString: string): Id | undefined {
  const id = IdInternal.create(16);
  for (let i = 0; i < 16; i++) {
    id[i] = idString.charCodeAt(i);
  }
  return id;
}

And it worked whereas multibase failed.

CMCDragonkai mentioned this issue Oct 17, 2021

Changing to Id derivative of Uint8Array in order to provide operator overloading MatrixAI/js-id#6

Merged

9 tasks

rvagg closed this as completed Oct 18, 2021

rvagg reopened this Oct 18, 2021

lidel mentioned this issue Oct 21, 2021

Avoid data loss by switching toString/fromString charCodes APIs? achingbrain/uint8arrays#27

Open

CMCDragonkai mentioned this issue Oct 22, 2021

Which of the multibase formats preserves lexicographic order? #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity encoding decoding doesn't produce the same data #122

Identity encoding decoding doesn't produce the same data #122

CMCDragonkai commented Oct 17, 2021 •

edited

rvagg commented Oct 18, 2021

CMCDragonkai commented Oct 18, 2021

rvagg commented Oct 18, 2021

CMCDragonkai commented Oct 18, 2021 •

edited

Identity encoding decoding doesn't produce the same data #122

Identity encoding decoding doesn't produce the same data #122

Comments

CMCDragonkai commented Oct 17, 2021 • edited

rvagg commented Oct 18, 2021

CMCDragonkai commented Oct 18, 2021

rvagg commented Oct 18, 2021

CMCDragonkai commented Oct 18, 2021 • edited

CMCDragonkai commented Oct 17, 2021 •

edited

CMCDragonkai commented Oct 18, 2021 •

edited