Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base64 / Binary loader output is different between esbuild.build and esbuild.transform #2424

Closed
jasoncabot opened this issue Jul 30, 2022 · 5 comments

Comments

@jasoncabot
Copy link

Sample Reproduction:

I wrote a very simple (1 byte) showcase of the problem here with reproduction steps:

https://github.com/jasoncabot/esbuild-issue

Problem

When using esbuild to transform code, binary strings are treated as UTF-8 which can cause the output to be corrupted and transformed to include the UTF-8 replacement character EF BF BD when an invalid byte sequence is detected.

In my reproduction I use the base64 loader to transform the byte C4 and would expect the result to be a base 64 encoded string that decodes to C4 however it is transformed into a string that decodes to EF BF BD but this is also an issue with binary loader (which is where I discovered it - but the base64 case was simpler to showcase the issue)

This is only an issue with transform, esbuild.build behaves as I would expect and preserves the binary data.

Expected

I would expect that base64 or binary data is treated as a raw stream of bytes and esbuild.transform to behave consistently with how esbuild.build works.

@evanw
Copy link
Owner

evanw commented Jul 30, 2022

This doesn't seem like a bug to me. The transform API in JavaScript currently only takes a string:

export declare function transform(input: string, options?: TransformOptions): Promise<TransformResult>;

All strings in JavaScript are UTF-16 (or UCS-2 depending on how you look at it). Strings aren't intended for storing binary data. I'd expect something like this to happen if you put invalid Unicode characters into a string.

So this issue really seems like a feature request for the transform API to be able to consume binary data, which I can understand. I can extend the transform API to take either a string or a Uint8Array.

@jasoncabot
Copy link
Author

I understand what you're saying with regards to strings in JavaScript being UTF-16 - however for "strings" that are loaded outside the system under your control - e.g from a file in my example it is possible they can be invalid to be interpreted as a JavaScript string.

fs.readFileSync(binaryFile).toString() will give you an (invalid) string of bytes that you can pass around - similar to how you can pipe invalid characters into esbuild with something like echo -n -e '\xC4' | esbuild --loader=base64 which gives what I would expect as a result module.exports = "xA==";

I'm not suggesting that it's a bug, just confusing behaviour as I'm calling transform(string, {loader: "binary"} and the string is not treated as an opaque string of bytes in the same way that build does.

Is there a way to do something like if loader == base64 || binary then skip trying to interpret the string as a valid JavaScript string in the transform bit of code?

@evanw
Copy link
Owner

evanw commented Jul 31, 2022

Both the transform and build APIs behave the same way regarding JavaScript strings:

console.log((await esbuild.transform('\xC4',{ loader: 'base64' })).code)
console.log((await esbuild.build({ stdin: { contents: '\xC4', loader: 'base64' }, write: false })).outputFiles[0].text)

This prints the following:

module.exports = "w4Q=";
module.exports = "w4Q=";

@evanw evanw closed this as completed in e23e181 Jul 31, 2022
@jasoncabot
Copy link
Author

Thanks for continuing the look into it - I'm sure it's me trying to use esbuild wrong but I just want to be sure. I do see a difference between whether I use:

esbuild.build(..., entryPoints: [<file>]) or esbuild.build(..., { stdin: { contents: <stringReadFromFile> }})

I'm not sure the above comparisons are totally valid because it returns "w4Q=" which when decoded is C3 84 and ideally it would still be C4 - although I can understand that JS is just representing the (invalid) raw byte as UTF-16 when it's created.

In my particular case I was running into this in unit tests using esbuild-jest

Just to show the difference I mentioned - you can try reading a raw byte from a file on disk, so try this:

# In your shell:
echo -n -e '\xC4' > test.bin

# JS
const fs = require('fs');
const esbuild = require('esbuild');
console.log((esbuild.transformSync(fs.readFileSync("test.bin").toString(), { loader: 'base64' })).code)
console.log(esbuild.buildSync({ entryPoints: ["test.bin"], loader: { ".bin": "base64" }, write: false }).outputFiles[0].text);

Gives the following output

module.exports = "77+9";
module.exports = "xA==";

@evanw
Copy link
Owner

evanw commented Jul 31, 2022

Yes, this is the difference between textual input and binary input. I have added the ability to pass binary data as input which will let you do what you want (by omitting .toString() in your example). It will be released with the next version of esbuild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants