Step generation is done sequentially, making providers that block in Diff extremely slow #12351

aschleck · 2023-03-04T04:01:43Z

What happened?

(as discussed in https://pulumi-community.slack.com/archives/C84L4E3N1/p1677886059484479)

pulumi-kubernetes with 489 resources takes 3 minutes to run preview (without any refreshing.) This feels sad.

Under profiling, we see that many RegisterResource calls are started at the same time, but appear to execute sequentially:

After digging in, I believe the problem is that pulumi-kubernetes makes API calls in its Diff method: https://github.com/pulumi/pulumi-kubernetes/blob/master/provider/pkg/provider/provider.go#L1559 . The step generation logic in deployment_executor.go loops through events one-at-a-time (

pulumi/pkg/resource/deploy/deployment_executor.go

Line 230 in 2e3b732

case event := <-incomingEvents:

) and then uses blocking calls to process each:

pulumi/pkg/resource/deploy/deployment_executor.go

Line 419 in 2e3b732

steps, res = ex.stepGen.GenerateSteps(e)

(note that the chain is executed in parallel, but the problem here is that Diff is called as part of generating the chain.)

After changing step generation to be entirely non-blocking, we suddenly see faster execution (3 minutes -> 30 seconds) and a more heterogynous execution trace:

Expected Behavior

I expected Pulumi preview to be fast.

Steps to reproduce

Have a project with many pulumi-kubernetes resources, then generate a trace while running preview.

Output of `pulumi about`

CLI
Version      3.55.0
Go Version   go1.20
Go Compiler  gc

Plugins
NAME           VERSION
kubernetes     3.23.1
python         unknown

Host
OS       ubuntu
Version  22.04
Arch     x86_64

Additional context

I have a hacky unsafe patch that I made to verify that making step generation parallel makes a huge difference. The part I know is wrong is that errors are hidden, and instead of Pulumi terminating on an error it just hangs. I think it is likely unsafe in other ways (one guess: it crashes sometimes with fatal error: concurrent map read and map write.)

diff --git a/pkg/resource/deploy/deployment_executor.go b/pkg/resource/deploy/deployment_executor.go
index cf6b3738d..1337ffcbf 100644
--- a/pkg/resource/deploy/deployment_executor.go
+++ b/pkg/resource/deploy/deployment_executor.go
@@ -19,6 +19,7 @@ import (
 	"errors"
 	"fmt"
 	"strings"
+	"sync"

 	"github.com/pulumi/pulumi/pkg/v3/resource/deploy/providers"
 	"github.com/pulumi/pulumi/pkg/v3/resource/graph"
@@ -38,6 +39,7 @@ type deploymentExecutor struct {

 	stepGen  *stepGenerator // step generator owned by this deployment
 	stepExec *stepExecutor  // step executor owned by this deployment
+	workers  sync.WaitGroup
 }

 // checkTargets validates that all the targets passed in refer to existing resources.  Diagnostics
@@ -271,6 +273,7 @@ func (ex *deploymentExecutor) Execute(callerCtx context.Context, opts Options, p
 		}
 	}()

+	ex.workers.Wait()
 	ex.stepExec.WaitForCompletion()
 	logging.V(4).Infof("deploymentExecutor.Execute(...): step executor has completed")

@@ -411,25 +414,34 @@ func (ex *deploymentExecutor) performDeletes(
 func (ex *deploymentExecutor) handleSingleEvent(event SourceEvent) result.Result {
 	contract.Requiref(event != nil, "event", "must not be nil")

-	var steps []Step
-	var res result.Result
 	switch e := event.(type) {
-	case RegisterResourceEvent:
-		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received RegisterResourceEvent")
-		steps, res = ex.stepGen.GenerateSteps(e)
-	case ReadResourceEvent:
-		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received ReadResourceEvent")
-		steps, res = ex.stepGen.GenerateReadSteps(e)
 	case RegisterResourceOutputsEvent:
 		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received register resource outputs")
 		return ex.stepExec.ExecuteRegisterResourceOutputs(e)
 	}

-	if res != nil {
-		return res
-	}
+	ex.workers.Add(1)
+	go func() {
+		defer ex.workers.Done()
+
+		var steps []Step
+		var res result.Result
+		switch e := event.(type) {
+		case RegisterResourceEvent:
+			logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received RegisterResourceEvent")
+			steps, res = ex.stepGen.GenerateSteps(e)
+		case ReadResourceEvent:
+			logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received ReadResourceEvent")
+			steps, res = ex.stepGen.GenerateReadSteps(e)
+		}
+
+		if res != nil {
+			return
+		}
+
+		ex.stepExec.ExecuteSerial(steps)
+	}()

-	ex.stepExec.ExecuteSerial(steps)
 	return nil
 }

@@ -453,6 +465,7 @@ func (ex *deploymentExecutor) importResources(
 		preview:    preview,
 	}
 	res := importer.importResources(ctx)
+	ex.workers.Wait()
 	stepExec.SignalCompletion()
 	stepExec.WaitForCompletion()

@@ -503,6 +516,7 @@ func (ex *deploymentExecutor) refresh(callerCtx context.Context, opts Options, p
 	ctx, cancel := context.WithCancel(callerCtx)
 	stepExec := newStepExecutor(ctx, cancel, ex.deployment, opts, preview, true)
 	stepExec.ExecuteParallel(steps)
+	ex.workers.Wait()
 	stepExec.SignalCompletion()
 	stepExec.WaitForCompletion()

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

justinvp · 2023-03-07T12:46:13Z

Thanks for opening an issue and for the analysis, @aschleck! I'm going to discuss with the team to see what we could do to improve this.

aschleck · 2023-03-13T19:06:13Z

Thank you @justinvp for taking a look at this! Have you folks had any opportunity to discuss what a solution might look like? I'm considering taking a stab at fixing it (since this is shockingly impactful) but I wouldn't want to spend time trying approaches that the team knows will fail.

gj · 2023-03-23T18:59:03Z

Hey @justinvp, any update from the team? I believe this is contributing to extremely slow preview runs for a stack with 1,312 resources.

RobbieMcKinstry · 2023-04-10T15:25:12Z

Hello! I spoke with Justin and the team late last week.

@aschleck, we would accept a community PR for this feature! Speaking for myself, I'd love to see a community contribution, especially one this impactful. 😄

However, I just want to issue a word of caution. This change is at the very heart of the engine. When we make changes in this area, we tread very gingerly because any change here will impact every user of Pulumi. We expect the bar to be very high on a PR for this change, and it might take some significant back-and-forth to get it merged. I also received feedback that the team expects the change to be difficult, difficult enough that it's outside of the realm of what we would normally expect to see from community contributions.

I don't say any of this to dissuade you, of course! 😸 Like I said, we're happy to receive contributions, and I'd love to see a performance improvement like this come in! We just wanted to set expectations ahead of time. :-) I'd hate for you to have a bad experience as a contributor if you submitted a PR and felt like we were stringing you along!

RobbieMcKinstry · 2023-04-10T15:33:52Z

FWIW, I think your approach using a WaitGroup is my first instinct as well. I think your assessment of the situation is great (though, others know this section of the codebase a lot better than me!)

The part I know is wrong is that errors are hidden, and instead of Pulumi terminating on an error it just hangs.

Yeah, that makes sense. My gut says you can return the errors over a channel, and close the channel after the wait group completes. I think it might be a little more complicated than that, because you'll have to aggregate the errors into a slice. So as not to block on a full channel from a Diffing goroutine, you'll also need to read from the channel concurrently.

I think it is likely unsafe in other ways (one guess: it crashes sometimes with fatal error: concurrent map read and map write.)

We've seen that error before! We typically enable race detection with the -race and hunt down the problem. I've seen the issue occur during diff calculation, when calculating olds and news.

aschleck · 2023-08-11T16:22:57Z

I'm sorry for never responding Robbie, I really appreciated your insight and just haven't had time to think about this until today. And major thank you to dixler for getting a PR out in the meantime!

As I understand it, with pulumi/pulumi-kubernetes@6e329f3 my particular problem is solved (a preview now takes 15 seconds instead of 3 minutes.) I defer to you folks on how you want to move forward. Please let me know if it'd be helpful for me to do further profiling and see if there's anything else I can find.

RobbieMcKinstry · 2023-08-16T14:56:37Z

Hi @aschleck! Good to hear from you! I'm glad my last message two messages didn't deter you from contributing; I was worried I was coming off too strong :)

Super cool! I'm glad Levi and Ramone were able to land such an impactful performance fix.

I'm no longer an OSS maintainer with Pulumi, so I'll also have to defer judgment. My professional opinion is to leave this issue open it identifies a valuable optimization, and it's plausible that other providers (frankly, all providers) would benefit from this change. It might be open for a bit, since it's a gnarly one, but it's worthwhile to keep around as part of #11598

Frassle · 2023-08-16T15:11:36Z

I'm no longer an OSS maintainer with Pulumi, so I'll also have to defer judgment. My professional opinion is to leave this issue open it identifies a valuable optimization, and it's plausible that other providers (frankly, all providers) would benefit from this change.

I agree, this will be hard but is worth looking into.

raelix · 2023-12-05T20:58:22Z

Hey @justinvp I saw this ticket and I tested some of the interesting suggestion provided above. I can confirm that by calling the handleSingleEvent in a different go routine increase the pulumi performances of at least 6x.
It means that on a stack with ~2k resources it completes in 6m instead of 30m.
Frankly, this could be (at least for me and maybe for who should handle dozens of resources) a fundamental factor to decide to go in production with Pulumi.

aschleck added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Mar 4, 2023

justinvp added impact/performance Something is slower than expected area/engine Pulumi engine and removed needs-triage Needs attention from the triage team labels Mar 7, 2023

RobbieMcKinstry mentioned this issue Apr 10, 2023

[Epic] Amazing Performance Initiative #11598

Open

65 tasks

mikhailshilkov added the customer/feedback Feedback from customers label Jun 14, 2023

dixler self-assigned this Jun 23, 2023

dixler added this to the 0.90 milestone Jun 23, 2023

dixler linked a pull request Jun 23, 2023 that will close this issue

make step_generator diff run in parallel #13261

Draft

9 tasks

dixler removed this from the 0.90 milestone Jun 23, 2023

Frassle added kind/enhancement Improvements or new features and removed kind/bug Some behavior is incorrect or out of spec labels Jun 28, 2023

justinvp unassigned dixler Oct 19, 2023

raelix mentioned this issue Dec 5, 2023

Pulumi preview is faster when using --diff #14732

Open

raelix mentioned this issue Dec 6, 2023

Pulumi huge performance improvements with parallel steps generation and thread-safe maps #14764

Open

6 tasks

PollRobots linked a pull request Mar 29, 2024 that will close this issue

Parallelizing the step generator and deployment executor #15026

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

aschleck commented Mar 4, 2023 •

edited

justinvp commented Mar 7, 2023

aschleck commented Mar 13, 2023

gj commented Mar 23, 2023

RobbieMcKinstry commented Apr 10, 2023

RobbieMcKinstry commented Apr 10, 2023

aschleck commented Aug 11, 2023

RobbieMcKinstry commented Aug 16, 2023

Frassle commented Aug 16, 2023

raelix commented Dec 5, 2023

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

Comments

aschleck commented Mar 4, 2023 • edited

What happened?

Expected Behavior

Steps to reproduce

Output of pulumi about

Additional context

Contributing

justinvp commented Mar 7, 2023

aschleck commented Mar 13, 2023

gj commented Mar 23, 2023

RobbieMcKinstry commented Apr 10, 2023

RobbieMcKinstry commented Apr 10, 2023

aschleck commented Aug 11, 2023

RobbieMcKinstry commented Aug 16, 2023

Frassle commented Aug 16, 2023

raelix commented Dec 5, 2023

aschleck commented Mar 4, 2023 •

edited

Output of `pulumi about`