Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity encoding decoding doesn't produce the same data #122

Open
CMCDragonkai opened this issue Oct 17, 2021 · 4 comments
Open

Identity encoding decoding doesn't produce the same data #122

CMCDragonkai opened this issue Oct 17, 2021 · 4 comments

Comments

@CMCDragonkai
Copy link

CMCDragonkai commented Oct 17, 2021

I'm not sure if identity encoding is meant to be used like this, but I noticed that after decoding, you don't get the same data:

import { bases } from 'multiformats/basics';

const codec = bases['identity'];

const u = new Uint8Array([
    6, 22, 184, 240, 237, 178,
  112,  0, 150, 137, 182,  54,
  220,  1, 217, 221
]);

const s = codec.encode(u);

const u_ = codec.decode(s);

console.log(u_);

/*
Uint8Array(36) [
    6,  22, 239, 191, 189, 239, 191, 189,
  239, 191, 189, 239, 191, 189, 112,   0,
  239, 191, 189, 239, 191, 189, 239, 191,
  189,  54, 239, 191, 189,   1, 239, 191,
  189, 239, 191, 189
]
*/
@rvagg
Copy link
Member

rvagg commented Oct 18, 2021

You're hitting limitations of JavaScript's UTF-8 handling. There are some bytes that JavaScript just won't properly preserve during a bytes->string->bytes round-trip. The in-built assumption is that conversion to UTF-8 from bytes involves actual UTF-8 characters, unlike some languages, such as Go which can []byte(string([]byte(...))) without loss (i.e. their strings can hold non-UTF-8 bytes).

To illustrate, take your 3rd byte, which can't be represented as UTF-8 (note how the first 2 are present in the round-trip):

> new TextDecoder().decode(new Uint8Array([184]))
'�'
> new TextDecoder().decode(new Uint8Array([184])).charCodeAt(0)
65533
> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([184])))
Uint8Array(3) [ 239, 191, 189 ]

So you can see that invalid UTF-8 bytes get converted to U+FFFD, i.e. 65533, which is the sequence of 3 bytes you see repeated in your resulting array: 239, 191, 189 - every time you see these, you can assume that it's a non-UTF-8 byte that got lost in translation.

The identity multibase doesn't have much choice here, it's only safe to use with bytes that can be properly converted with JavaScript to strings, or use a multibase that maps characters to avoid this problem (which is one of the points of using base encoding!).

I hope that helps explain the situation, even if it probably doesn't give you an easy solution.

@rvagg rvagg closed this as completed Oct 18, 2021
@CMCDragonkai
Copy link
Author

I used codePointAt to convert to JS binary strings and back. https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary

Maybe that can be used instead?

@rvagg
Copy link
Member

rvagg commented Oct 18, 2021

Hm, that might not be a bad idea since codepoint addressing is now standard across runtimes.

@rvagg rvagg reopened this Oct 18, 2021
@CMCDragonkai
Copy link
Author

CMCDragonkai commented Oct 18, 2021

Yea I used it for the above example and I compared it to multibase to see if there was any differences.

https://github.com/MatrixAI/js-id/blob/4ea34f2b50e8f259576fc2f8bb9f80d9a167e1a1/src/utils.ts#L75-L85

function toString(id: Uint8Array): string {
  return String.fromCharCode(...id);
}

function fromString(idString: string): Id | undefined {
  const id = IdInternal.create(16);
  for (let i = 0; i < 16; i++) {
    id[i] = idString.charCodeAt(i);
  }
  return id;
}

And it worked whereas multibase failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants