Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deduplicate assets with buffer source #4712

Merged

Conversation

patak-dev
Copy link
Contributor

@patak-dev patak-dev commented Nov 10, 2022

This PR contains:

  • bugfix
  • feature
  • refactor
  • documentation
  • other

Are tests included?

  • yes (bugfixes and features will not be merged without tests)
  • no

Breaking Changes?

  • yes (breaking changes will not be merged unless absolutely necessary)
  • no

List any relevant issue numbers:

resolves #4711

Description

See description at #4711

Some questions:

  • Is this feature ok in Rollup or should we keep it in Vite? If we do this in Vite we will always pay for two hashing passes instead of one and increase the overall complexity, so I think it makes sense to live here.
  • Is an option needed? There is a performance penalty for users when not having [hash] in assetFileNames, so I assume this should be optional.
  • I don't think deduplicateBinaryAssets is a good name here, I added it to start the discussion. It could also be an enum: 'only-strings' | 'best-try' (only deduplicate if there is a [hash], not implemented here) | 'always'

@patak-dev patak-dev marked this pull request as ready for review November 10, 2022 07:37
@patak-dev
Copy link
Contributor Author

Note: I would need some help with wiring up the new sample test

@lukastaegert
Copy link
Member

Thanks a lot! Will have a look as soon as I find time.

@lukastaegert
Copy link
Member

Honestly I do not think we need an option and should just leave it on by default. I do not think anyone would turn this option off as I still expect the performance overhead to be small compared to overall build time and as you said, usually you will want to use the hash anyway.

@codecov
Copy link

codecov bot commented Nov 10, 2022

Codecov Report

Merging #4712 (c8484d6) into master (ea36d8b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #4712   +/-   ##
=======================================
  Coverage   99.02%   99.02%           
=======================================
  Files         215      215           
  Lines        7597     7599    +2     
  Branches     2106     2104    -2     
=======================================
+ Hits         7523     7525    +2     
  Misses         24       24           
  Partials       50       50           
Impacted Files Coverage Δ
src/utils/FileEmitter.ts 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@patak-dev
Copy link
Contributor Author

Honestly I do not think we need an option and should just leave it on by default. I do not think anyone would turn this option off as I still expect the performance overhead to be small compared to overall build time and as you said, usually you will want to use the hash anyway.

Yes, I'm with you here too. And maybe the potential to break someone is small enough to add this in a minor. The only problem could be that a new asset is deduplicated that wasn't before, but I don't expect anyone to be relying on this.

I can remove the option and simplify the code if you would like to go in that direction.

@lukastaegert
Copy link
Member

break someone is small enough to add this in a minor

Definitely, especially since we are already duplicating string assets.

I can remove the option and simplify the code if you would like to go in that direction

Yes, I would. There are already complaints that we have too many options.

@patak-dev
Copy link
Contributor Author

The option has been removed and the code is now a lot easier to follow.

We could avoid the need to pregenerate the file name to find the hash if the type of source is string, but if we do so, we can't deduplicate an asset that was emitted first with a string source and later as a Buffer source. So partitioning the Map into string or Buffer goes against consistency IMO, and it is better to pay for the extra hashing every time emitFile is called.

@lukastaegert I still would need help with the new test assertions, although now the new paths are being tested by several other tests too.

Copy link
Member

@lukastaegert lukastaegert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks so much! Just one minor comment about when the hashes are generated, and a suggestion for the test.

src/utils/FileEmitter.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,18 @@
const source = Buffer.from('abc');

module.exports = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, this test does not really have any assertions and a function test may not be the best fit for asset deduplication as it does not verify _expected. Why not just extend the existing chunking-form test as it was before Rollup 3: https://github.com/rollup/rollup/blob/69ff4181e701a0fe0026d0ba147f31bc86beffa8/test/chunking-form/samples/emit-file/deduplicate-assets/_config.js
We had a test for buffer deduplication, so it should just be green again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also include cross Buffer/string deduplication as well in the test by emitting each asset a third time using the other format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice! I missed the deduplicate-assets test before. Extended it in rollup/rollup@9dbffe1 (#4712) and also added a different buffer source for completeness.

I also didn't know that Rollup 2 was already deduplicating binary sources. So then there is even less of a reason for the option if this was already how things worked before. Is my understanding correct that cross-buffer/string deduplication didn't happen in Rollup 2 though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that cross-buffer/string deduplication didn't happen in Rollup 2 though?

Yes, that is indeed a new feature!

@patak-dev patak-dev changed the title feat: deduplicate binary assets option feat: deduplicate assets with buffer source Nov 11, 2022
);
sourceHash ??= getSourceHash(source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought: Why not just const sourceHash = getSourceHash(source) at the top? But then I noticed that we allow specifying a hash size. Now this raises a problem, though an extremely edge case one: If for some reason with a dynamic asset file name function two assets with the same source are given file names with different hash length, they are not recognized as identical.
Going even deeper, why would we even be calling the assetFileNames function twice in such a situation? What we actually want is that we first calculate the hash, then compare the hash, and only if it does not match we call the function for the name. For this to work reliably, the first hash needs to have the maximum hash length (e.g. by passing Infinity as the size, which means we do not slice). Then during the actual file name generation we can slice the hash to the required size. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the contortions in the code were because of trying to reuse the dynamic hash size.

Your idea is great, implemented it in refactor: precompute hash then slice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot easier to directly check the final changelog, as the code is now a lot closer to the original https://github.com/rollup/rollup/pull/4712/files

Copy link
Member

@lukastaegert lukastaegert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@lukastaegert lukastaegert enabled auto-merge (squash) November 12, 2022 05:04
@lukastaegert lukastaegert merged commit 842ff0d into rollup:master Nov 12, 2022
@rollup-bot
Copy link
Collaborator

This PR has been released as part of rollup@3.3.0. You can test it via npm install rollup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplicate assets when using Buffer content in emitFile
3 participants