Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple passes on tree shaking #3200

Closed
tjenkinson opened this issue Oct 28, 2019 · 13 comments
Closed

Multiple passes on tree shaking #3200

tjenkinson opened this issue Oct 28, 2019 · 13 comments

Comments

@tjenkinson
Copy link
Member

Feature Use Case

Tree shaking works great, but sometimes more things could be removed if more passes were made.

E.g
repl

const a = true;
var b = true;

if (!a) {
	b = false;
}

if (!b) {
	console.log('Can be removed');
}

becomes

'use strict';

var b = true;

if (!b) {
	console.log('Can be removed');
}

If there was another iteration then the if (!b) { statement would be removed.

I understand currently it isn't removed because of the b = false, but this would be missing in the second pass.

Note this is a simplified example, but we do have some logic in our app that would benefit from this.

Feature Proposal

Run multiple passes until no more changes are made, or figure out a way of detecting if reassignments to a variable are in a branch that would be removed.

Happy to investigate this and open a PR!

@tjenkinson
Copy link
Member Author

I think maybe this shouldn't be happening if the variable is in a dead code block, given the reassign would never happen.

this.isReassigned = true;

deoptimizePath() seems to be getting called from a bind() in IfStatement.

Not sure how to walk up the tree from here though and determine of the code would be included.

@lukastaegert
Copy link
Member

The problem is indeed that these deoptimizations happen during bind and not during the actual tree-shaking when it is clearer if the variable is part of included code. This is mostly because of performance reasons to avoid having repeated checks or deoptimizations during the multiple treeshaking passes. At the moment, we can rely on the fact that we know all that is necessary about variable values during the actual treeshaking. I am aware of the scenario you describe, but I am not sure, how common it actually is.

@lukastaegert
Copy link
Member

But maybe it would be worthwhile to challenge this, especially if you are willing to do some digging. Worst thing you would learn a little about how the algorithm works 😉. So a little background how treeshaking works:

  • First we initialise each node. Most importantly this will create the scopes and add all declarations as Variables to the scopes.
  • Once this is done, we bind each node. First and foremost, this will associate all non-declaration identifiers with the correct variables from the previous step. Additionally, it will also reassign (“deoptimise“) various variables that we cannot track in the analysis, e.g. because they are reassigned at some point. This is what we want to move to the next phase.
  • The actual treeshaking starts from the top of the AST, asking each statement if it shouldBeIncluded. By default this is the case if a statement hasEffects or if it was included by a previous pass. One thing you might try is to do the deoptimisation the first time hasEffects is called on an assignment.

@tjenkinson
Copy link
Member Author

@lukastaegert thanks! I'm interested in learning more about rollup so will take a look when I get time

@tjenkinson
Copy link
Member Author

tjenkinson commented Oct 30, 2019

Ok so adding

	shouldBeIncluded(context: InclusionContext): boolean {
		this.left.deoptimizePath(EMPTY_PATH);
		this.right.deoptimizePath(UNKNOWN_PATH);
		return super.shouldBeIncluded(context);
	}

and removing the bind() override in AssignmentExpression looks like it works! (ignoring test failures)

Is this what you were thinking? I guess then the same kind of thing should be done in all the other src/ast/nodes?

@lukastaegert
Copy link
Member

Yes, but it would be nice to also track via a flag if this has happened so that it only happens on the first pass as deoptimizations are not necessarily cheap (and depending on the code, there can be 20 passes or more). Now there are some points to consider:

  • It is not GUARANTEED that shouldBeIncluded is actually called for a statement. Another way is that if the assignment happens in a try-statement. By default, everything inside a try-statement is always included without checking for side-effects. This case can be detected by checking if include is called with a truthy value for includeChildren.
  • The question is if deoptimizations will be picked up correctly by expressions that cache values, e.g. if-statements. For instance this should work correctly:
let value = false;
try {
  value = true;
} catch {}
if (value) {
  console.log('must be included');
}

I hope there are no other cases that are forgotten. The problem with the current approach is that it creates some "hidden" dependencies between some parts of the code, but maybe this is manageable for the moment.

The deoptimizations in other bind statements could also be investigated.

@tjenkinson
Copy link
Member Author

@lukastaegert I opened a draft PR #3212 which fails 1 test (deoptimize-member-expressions)

I'm still pretty confused with how it works though and have a lot of questions. Don't suppose there's some docs anywhere on include(), shouldBeIncluded(), hasEffects() IncludeContext, HasEffectsContext etc? Or could you recommend any guides/blogs that would help with some of the concepts? I'd really like to be able to understand more what's going on.

There's code like this for example in LocalVariable

	hasEffectsWhenAccessedAtPath(path: ObjectPath, context: HasEffectsContext) {
		if (path.length === 0) return false;
		if (this.isReassigned || path.length > MAX_PATH_DEPTH) return true;
		const trackedExpressions = context.accessed.getEntities(path);
		if (trackedExpressions.has(this)) return false;
		trackedExpressions.add(this);
		return (this.init && this.init.hasEffectsWhenAccessedAtPath(path, context)) as boolean;
	}

Why do we always return false after the first call when trackedExpressions.has(this)? What does trackedExpressions even mean??

@tjenkinson
Copy link
Member Author

I found https://github.com/estree/estree and https://astexplorer.net/ which helped with some things. I’m thinking about creating a dev getting started guide or something similar with links after all this

Maybe another approach would be to register all the assignment nodes with LocalVariable during bind(). Then during the treeshaking run getLiteralValueAtPath() on a LocalVariable can check hasEffects/shouldBeIncluded() on all the assignments. If this is all false then it knows it can return the init value.

AssignmentPattern would then have the ability to say if it would be included by checking shouldBeIncluded up the parents. (Still not really sure what the difference is between shouldBeIncluded and hasEffects, other than maybe the result of shouldBeIncluded is cached and hasEffects isn’t)

@lukastaegert
Copy link
Member

I'm still pretty confused with how it works though and have a lot of questions. Don't suppose there's some docs anywhere on include(), shouldBeIncluded(), hasEffects() IncludeContext, HasEffectsContext etc

Unfortunately not. The reason is that these are internal implementation details that are often subject to change, and any kind of documentation could be outdated, if not outright wrong in half a year. Therefore the goal is to make the names as descriptive as possible so that they tell you what they do. If you as an outsider have suggestions for better names or you find things misleading, I am very open to improvement suggestions here, even if it means changing a name in 50 files!

To figure out what a specific piece of code does, I recommend changing or removing the code in question and see what tests turn red. This may not REALLY tell you what the mechanics are, but it will tell you which problem was solved by the code.

Thus, test coverage is CRUCIAL in this project, and I very much recommend you write tests for everything new you implement early on, maybe even test-first. At least for me, tests are also a great debugging tool as they allow you to easily run Rollup on a given piece of code while finding out what is going on. I am saying this because you already implemented something new without a test :) Also, tests are important for documenting what your PR is trying to achieve.

For some time now I wanted to write a TESTS.md documentation for how to write tests but never get around to it. For your purpose, two kinds of tests are important:

function tests bundle and run a piece of code and are great for testing regressions, in your case situations where important code is dropped. They can also test for warnings and errors. I would always write one when during development of a feature, you come across edge cases that do not work with a previous version of the feature.

form tests are basically snapshot tests of the code and are the only way to test how successful tree-shaking is. By default, they will bundle the code for every format but there is no reason to do so if nothing format specific such as external imports and exports are involved. In that case, they can be turned into single-format tests by placing a files _expected.js next to the _config.js file (instead of the usual _expected folder).

Why do we always return false after the first call when trackedExpressions.has(this)? What does trackedExpressions even mean?

trackedExpressions is taken from context.accessed and is necessary to prevent infinite recursions/stack overflow bugs. It works like this:

  • Whenever the logic tests if a statement shouldBeIncluded, a new HasEffectsContext is created (see
    return this.included || (!context.brokenFlow && this.hasEffects(createHasEffectsContext()));
    )
  • This context tracks various changing properties of the current control flow, e.g. if we are behind a break/return/throw (context.brokenFlow), if we are checking a loop for side-effects from the outside, in which case break statements inside the loop should not count as side-effects (context.ignore.breaks, similar for the other ignore properties), and makes sure that we check each AST node for a certain kind of side-effect at most once (context.accessed, context.assigned, context.called, context.instantiated).

Here is an example issue that is caught by this:

var foo = foo || {};
console.log(foo.x);

When checking if accessing foo.x has a side-effect, it would check the definition of foo and therefore check if (foo || {}) has a side-effect when accessing .x. Which would again check foo.x for side-effects and thus become recursive infinite loop. There is a whole class of similar issues with various tests around this.

Why do we always return false after the first call

The reason is that if this was already checked for side-effects and we are STILL searching for side-effects, then the first call must have returned false and we can do an early return.

I found https://github.com/estree/estree and https://astexplorer.net/ which helped with some things. I’m thinking about creating a dev getting started guide or something similar with links after all this

That would be REALLY awesome! Another pro-tip: Each AST Node has a custom .toString method that will print the original code associated with the Node, which can be very helpful with debugging (i.e. if you are adding log statements to certain AST nodes, by logging this.toString() you can also find out which precise Node you are looking at at the moment).

Maybe another approach would be to register all the assignment nodes with LocalVariable during bind(). Then during the treeshaking run getLiteralValueAtPath() on a LocalVariable can check hasEffects/shouldBeIncluded() on all the assignments. If this is all false then it knows it can return the init value.

Not sure this is a good idea for various reasons

  • In an earlier version of the algorithm, I tracked ALL assignments separately, instead of the simple .isReassigned approach for more than one value. This KILLED performance for large code-bases, especially when there were variables that had many values assigned to them. In some situations, bundling time turned from a minute to hours.
  • hasEffects/shouldBeIncluded is not relevant unless the Node is actually part of a "live" code path. You only know that by actually tracking live code paths, which is what we do by moving the check into hasEffects or shouldBeIncluded.

AssignmentPattern would then have the ability to say if it would be included by checking shouldBeIncluded up the parents

Checking parents is a bad pattern (at least in that code base) that should be avoided as much as possible.

  • It is not following the actual control flow
  • It is creating two-way dependencies between parents and children, i.e. a parent is no longer in full control over what a child knows about its surroundings, which will make future transformations, e.g. inlining a node somewhere else in the tree, MUCH harder.

Thus the idea is to practice a "one-way-data-flow" pattern as much as possible, not for efficiency's but for complexity's sake. This is also why the NodeRenderOptions were included to pass additional information from parent to child. There already is one exception to the rule–when a (potentially hoisted) declaration is included, then all parents receive the included flag, because I really could not figure out a better way to do this. This also marks a big difference between shouldBeIncluded and hasEffects–stuff can be included by different means, in which case it will never be checked for effects.

I will leave some more comments at your PR, by now I also figured out why one test is failing, because you actually uncovered a small bug.

@tjenkinson
Copy link
Member Author

Thanks for the super detailed explanation and the comments on the PR!

The reason is that if this was already checked for side-effects and we are STILL searching for side-effects, then the first call must have returned false and we can do an early return.

Took a while but yeh I get this now :D

I started drafting some docs here, which from what you said before might be a bit detailed, but it's helping me for now anyway

One thing I'm trying to understand now is whether it's possible for deoptimizeCache() to be called on a DeoptimizableEntity after a hasEffects() or include() call where this.testValue might already have been used

Searching the code currently it seems like this might not be possible, but moving the deoptimize call into shouldBeIncluded() or hasEffects() in the other PR might cause this?

In an IfStatement for example this could cause the includeKnownTest() to run when it should have been includeUnknownTest()

@lukastaegert
Copy link
Member

lukastaegert commented Nov 8, 2019

The reason is that if this was already checked for side-effects and we are STILL searching for side-effects, then the first call must have returned false and we can do an early return

Actually my explanation was somewhat incomplete. There are two cases:

  • we already did this check and it returned false: In this case, it is just a performance optimization
  • we are still finding out if this property access has side-effects and have reached the same expression again: In this case, it does not make sense to check it again as this check cannot reveal any new result, but it will also become an infinite loop. This is the main reason we do this.

I started drafting some docs

Very nice! I have looked through it, some points I came across:

Graph: Loads the plugins

I think this is misleading. It creates the PluginDriver, which is basically a plugin manager. The plugins themselves are already intantiated and working when they are passed to Rollup by the user, it is just an adapter to unify and simplify accessing the plugins by Rollup.

generation stage

This name choice is probably not ideal as generation in Rollup usually refers to whatever happens when .generate is called on a bundle returned by rollup.rollup (i.e. when an output is rendered). Chunking as such is still part of the build stage. So maybe Chunk generation stage or Chunking stage. BTW I hope that at some point we can actually move this stage to the generate phase. It does not take a lot of time, but this would allow having different chunks per output, or some outputs with inlined dynamic imports.

Otherwise really nice so far, great work.

might be a bit detailed

Probably a little, but we'll see.

One thing I'm trying to understand now is whether it's possible for deoptimizeCache() to be called on a DeoptimizableEntity after a hasEffects() or include() call where this.testValue might already have been used

Actually this can happen when the call is moved, but this should not a problem. Consider this example:

let testValue = true;

let reassignTestValue = () => {};
reassignTestValue = () => testValue = false; // A

reassignTestValue(); // B

if (testValue) { // C
  console.log(1);
} else {
  console.log(2);
}

Here is what should happen:

  • In line (A), Rollup internally marks reassignTestValue as "unknown" (Note that the variable itself is unused at that point and not yet included).
  • That way in line (B), it will not check the declaration for side-effects but just assume that the call has side-effects and needs to be retained. Thus hasEffects is not called on the AssignmentExpression testValue = false during this tree-shaking pass.
  • However, Rollup will add the included flag to testValue
  • Thus during the first tree-shaking pass, it will assume that testValue is true in line C.
  • However as during this pass, a lot of new variables were added, this triggers a new pass (this is the relevant code that triggers new passes:
    this.graph.needsTreeshakingPass = true;
    )
  • During the second pass, Rollup sees that reassignTestValue is included. Therefore, it will look at the function body and check each statement for side-effects to see which are relevant. Thus at this stage, testValue will be deoptimized and therefore deoptimizeCache will be called on the IfStatement.
  • Then when the second pass reaches the IfStatement, the other branch is included.

Note that every statement that was previously included is included again for each new pass. The reason is that since new variables were included, expressions that previously did not have side-effects may now have side-effects, and for instance the body of an included function could receive additional included statements during the next pass. AssignmentExpressions are the most important example here, but of course this extends to calls to functions that contain assignments etc. Thus on the first pass includeKnownTest is called while on the second pass, it calls includeUnknownTest, so all is fine.

@tjenkinson
Copy link
Member Author

Closing this now given the pr is merged :)

I haven’t had time to read/process all this yet so I might come back with more questions in the future

@JSerFeng
Copy link

JSerFeng commented Apr 16, 2022

Thanks for the super detailed explanation and the comments on the PR!

The reason is that if this was already checked for side-effects and we are STILL searching for side-effects, then the first call must have returned false and we can do an early return.

Took a while but yeh I get this now :D

I started drafting some docs here, which from what you said before might be a bit detailed, but it's helping me for now anyway

One thing I'm trying to understand now is whether it's possible for deoptimizeCache() to be called on a DeoptimizableEntity after a hasEffects() or include() call where this.testValue might already have been used

Searching the code currently it seems like this might not be possible, but moving the deoptimize call into shouldBeIncluded() or hasEffects() in the other PR might cause this?

In an IfStatement for example this could cause the includeKnownTest() to run when it should have been includeUnknownTest()

Thanks from 2022, finding ways to learn more about Rollup internal😃, love that dev guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants