Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme memory-intensive SHA256 calculation #3186

Closed
sebiniemann opened this issue Oct 23, 2019 · 11 comments
Closed

Extreme memory-intensive SHA256 calculation #3186

sebiniemann opened this issue Oct 23, 2019 · 11 comments

Comments

@sebiniemann
Copy link
Contributor

sebiniemann commented Oct 23, 2019

  • Rollup Version: 1.25.2
  • Operating System (or Browser): Windows
  • Node Version: 10.16.0

How Do We Reproduce?

Emit a very huge asset (250mb on my system -- a mp4 file for my use-case) without a final file name (adding a name option, instead of fileName)

this.emitFile({
  type: 'asset',
  name: 'something',
  source: hugeBufferOrVeryLongString,
});

Expected Behavior

Everything builds fine.

Actual Behavior

A few GB of memory are allocated (even increasing the NodeJS limit to 8 GB was not enough), until it crashes with an

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

error.

Proposed solution

The reason for this seems to be located in the generateAssetFileName, especially

const hash = sha256();
hash.update(emittedName);
hash.update(':');
hash.update(source);
return hash.digest('hex').substr(0, 8);

The hash.js/lib/hash/sha/256 library used to generate the hash seems to be very memory-intensive.

Using crypto instead (I assume NodeJS modules are fine, as fs is also used) resolvs this problem perfectly, without any notable memory increase:

- import sha256 from 'hash.js/lib/hash/sha/256';
+ import crypto from 'crypto';

...

- const hash = sha256();
+ const hash = crypto.createHash('sha256');
@sebiniemann
Copy link
Contributor Author

After investigating it a little further, the reason behind this huge memory allocation can be found in the following lines (from hash.js):

https://github.com/indutny/hash.js/blob/fcd35edcc4e30145be4b7a737477ad54e83afc25/lib/hash/utils.js#L58-L59

At first I suspected a missing pre-allocation to be the reason, but even the final array size is more than my system could handle.

For example, the size of my video buffer was 229878635. Creating just this array in a fresh NodeJS console:

Array.from({ length: 229878635 }, () => 0)

already resulted in a

FATAL ERROR: invalid table size Allocation failed - JavaScript heap out of memory

for me.

@lukastaegert
Copy link
Member

Thanks for finding this. The reason for not using "crypto" is that we do not have crypto available in the browser build.

fs is also used

In the browser build, it is replaced by a fallback that throws an error when used. Nevertheless, we need the hashes e.g. for the REPL, and I know there are other uses of Rollup in the browser. Also I vaguely remember that crypto performance may not have been the best, but this would require benchmarking. But maybe we should shop for another library. https://github.com/asmcrypto/asmcrypto.js seems to be very fast and from anecdotal evidence should be capable of handling files in that order of magnitude, but we should check.

@lukastaegert
Copy link
Member

So in my experiment, it was able to Hash a 1GB file without me having to increase memory, which could be a good sign.

@sebiniemann
Copy link
Contributor Author

Thanks for the very fast reply and your time to look into this issue. Awesome 😃

asmcrypto.js seems to work fine. I rewrote the hashing inside rollup on my machine and could see a similar memory footprint as when I used crypto or the Web Crypto API. Meaning no noticeable increase during the hashing.

I will try a few larger files (tested it with my 250 MB video just now) to be sure about this and could provide a pull request with the necessary changes to use asmcrypto.js instead of hash.js :)

@lukastaegert
Copy link
Member

Ah, I did not consider the Web Crypto API. This could also be a nice solution for the browser build. Would be nice to know how they compare performance-wise. If asmcrypto.js is not noticeably faster, we could also go for a combination of both, though of course a bundled hashing algorithm has the advantage that the browser build would still work in Node. In any case, PR welcome!

@tivac
Copy link
Contributor

tivac commented Oct 24, 2019

Could the crypto lib not be swapped out for the browser build? It's what I do for the repl on m-css.com and it works pretty well in practice.

@lukastaegert
Copy link
Member

Yes exactly. We could swap crypto for WebCrypto just as we swap fs and path. The question is just how they stack up.

@sebiniemann
Copy link
Contributor Author

sebiniemann commented Oct 24, 2019

Memory-wise, I didn't see much of a difference between asmcrypto.js, crypto and the Web Crypto API. Performance-wise however, it became very clear:

asmcrypto.js

100 MB: took 816.19 ms
200 MB: took 1497.09 ms
300 MB: took 2272.74 ms
400 MB: took 3114.04 ms
500 MB: took 4057.19 ms
600 MB: took 4526.54 ms
700 MB: took 5204.03 ms
800 MB: took 5924.95 ms
900 MB: took 6635.69 ms
1000 MB: took 7365.82 ms

crypto

100 MB: took 249.60 ms
200 MB: took 490.93 ms
300 MB: took 736.04 ms
400 MB: took 983.08 ms
500 MB: took 1275.39 ms
600 MB: took 1588.52 ms
700 MB: took 1743.36 ms
800 MB: took 1951.50 ms
900 MB: took 2244.30 ms
1000 MB: took 2564.22 ms

Web Crypto API

100 MB: took 314.15 ms
200 MB: took 626.33 ms
300 MB: took 915.76 ms
400 MB: took 1228.02 ms
500 MB: took 1539.00 ms
600 MB: took 1862.53 ms
700 MB: took 2141.59 ms
800 MB: took 2454.08 ms
900 MB: took 2804.14 ms
1000 MB: took 3091.66 ms

A mixed crypto (NodeJS) and Web Crypto API (browser) solution seems to be the best one.

I can provide you a pull request tomorrow. if I saw it correctly, it should contain the following:

  • Replacing hash.js inside src/utils/FileEmitter.ts with crypto
  • Replacing hash.js inside src/Chunk.ts with crypto
  • Adding a browser-ready/Web Crypto API based crypto wrapper inside browser/crypto.ts.
  • Marking crypto as external inside rollup.config.js and adding if (~id.indexOf('crypto.ts')) return fs.readFileSync('browser/crypto.ts', 'utf-8');
  • Removing typings/Chunk.d.ts
  • Removing hash.js from the LICENSE.md and package.json
  • Running npm install to update the package-lock.json

@lukastaegert
Copy link
Member

Wow, these numbers look impressive, and we have a clear winner, thanks so much! Also your TODO list looks spot on.

Only point: You do not need to remove hash.js from the license file, it should automatically correct itself when you do a full rebuild of Rollup without the dependency.

@sebiniemann
Copy link
Contributor Author

sebiniemann commented Oct 26, 2019

While finishing the Web Crypto API based crypto wrapper, I stumbled over a problem. The digest function inside SubtleCrypto is async, while most of rollup is synchronous. Rewriting rollup for this seems to be an overkill.

I would therefore suggest to use asmcrypto.js instead for the browser-ready wrapper and crypto (due to its superior performance) for NodeJS builds.

My Web Crypto API based wrapper looked like this:

function createHash (algorithm) {
  return new Hasher(algorithm);
}

class Hasher {
  constructor(algorithm) {
    switch(algorithm) {
      case 'sha256':
        this.algorithm = 'SHA-256';
        break;
      default:
        throw new Error(`Unsupported algorithm: '${algorithm}'.`);
    }

    this.buffer = new Uint8Array(0);
  }

  update(data) {
    let dataBuffer = (typeof data === 'string' ? new TextEncoder('utf-8').encode(data) : data);

    const appendedBuffer = new Uint8Array(this.buffer.length + dataBuffer.length);
    appendedBuffer.set(this.buffer, 0)
    appendedBuffer.set(dataBuffer, this.buffer.length)

    this.buffer = appendedBuffer;

    return this;
  }

  async digest(encoding) {
    let hash = await crypto.subtle.digest(this.algorithm, this.buffer);

    switch(encoding) {
      case 'hex':
          hash = Array.from(new Uint8Array(hash))
            .map(x => x.toString(16).padStart(2, '0'))
            .join('');
        break;
      default:
        throw new Error(`Unsupported encoding: '${encoding}'.`);
    }

    return hash;
  }
}

My asmcrypto.js based wrapper looks like this:

import * as asmCrypto from 'asmcrypto.js';

function createHash(algorithm: string) {
	return new Hasher(algorithm);
}

class Hasher {
	hasher: asmCrypto.Sha256;

	constructor(algorithm: string) {
		switch (algorithm) {
			case 'sha256':
				this.hasher = new asmCrypto.Sha256();
				break;
			default:
				throw new Error(`Unsupported algorithm: '${algorithm}'.`);
		}
	}

	digest(encoding: string) {
		this.hasher.finish();
		// The type of '.result' is always 'Uint8Array' after calling 'finish()'. Before that, it could also be 'null'. Casting
		// is necessary to avoid type errors when encoding the hash.
		const hash = this.hasher.result as Uint8Array;

		let encoded: string;
		switch (encoding) {
			case 'hex':
				encoded = asmCrypto.bytes_to_hex(hash);
				break;
			default:
				throw new Error(`Unsupported encoding: '${encoding}'.`);
		}

		return encoded;
	}

	update(data: string | Buffer) {
		if (typeof data === 'string') {
			this.hasher.process(new Uint8Array(data.length).map((_, i) => data.charCodeAt(i)));
		} else {
			this.hasher.process(data);
		}

		return this;
	}
}

export { createHash };

@sebiniemann
Copy link
Contributor Author

Fixed by #3194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants