Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

Open
Tracked by #11598
aschleck opened this issue Mar 4, 2023 · 9 comments · May be fixed by #15026 or #13261
Open
Tracked by #11598

Step generation is done sequentially, making providers that block in Diff extremely slow #12351

aschleck opened this issue Mar 4, 2023 · 9 comments · May be fixed by #15026 or #13261
Labels
area/engine Pulumi engine customer/feedback Feedback from customers impact/performance Something is slower than expected kind/enhancement Improvements or new features

Comments

@aschleck
Copy link

aschleck commented Mar 4, 2023

What happened?

(as discussed in https://pulumi-community.slack.com/archives/C84L4E3N1/p1677886059484479)

pulumi-kubernetes with 489 resources takes 3 minutes to run preview (without any refreshing.) This feels sad.

Under profiling, we see that many RegisterResource calls are started at the same time, but appear to execute sequentially:
image

After digging in, I believe the problem is that pulumi-kubernetes makes API calls in its Diff method: https://github.com/pulumi/pulumi-kubernetes/blob/master/provider/pkg/provider/provider.go#L1559 . The step generation logic in deployment_executor.go loops through events one-at-a-time (

case event := <-incomingEvents:
) and then uses blocking calls to process each:
steps, res = ex.stepGen.GenerateSteps(e)
(note that the chain is executed in parallel, but the problem here is that Diff is called as part of generating the chain.)

After changing step generation to be entirely non-blocking, we suddenly see faster execution (3 minutes -> 30 seconds) and a more heterogynous execution trace:
image

Expected Behavior

I expected Pulumi preview to be fast.

Steps to reproduce

Have a project with many pulumi-kubernetes resources, then generate a trace while running preview.

Output of pulumi about

CLI
Version      3.55.0
Go Version   go1.20
Go Compiler  gc

Plugins
NAME           VERSION
kubernetes     3.23.1
python         unknown

Host
OS       ubuntu
Version  22.04
Arch     x86_64

Additional context

I have a hacky unsafe patch that I made to verify that making step generation parallel makes a huge difference. The part I know is wrong is that errors are hidden, and instead of Pulumi terminating on an error it just hangs. I think it is likely unsafe in other ways (one guess: it crashes sometimes with fatal error: concurrent map read and map write.)

diff --git a/pkg/resource/deploy/deployment_executor.go b/pkg/resource/deploy/deployment_executor.go
index cf6b3738d..1337ffcbf 100644
--- a/pkg/resource/deploy/deployment_executor.go
+++ b/pkg/resource/deploy/deployment_executor.go
@@ -19,6 +19,7 @@ import (
 	"errors"
 	"fmt"
 	"strings"
+	"sync"

 	"github.com/pulumi/pulumi/pkg/v3/resource/deploy/providers"
 	"github.com/pulumi/pulumi/pkg/v3/resource/graph"
@@ -38,6 +39,7 @@ type deploymentExecutor struct {

 	stepGen  *stepGenerator // step generator owned by this deployment
 	stepExec *stepExecutor  // step executor owned by this deployment
+	workers  sync.WaitGroup
 }

 // checkTargets validates that all the targets passed in refer to existing resources.  Diagnostics
@@ -271,6 +273,7 @@ func (ex *deploymentExecutor) Execute(callerCtx context.Context, opts Options, p
 		}
 	}()

+	ex.workers.Wait()
 	ex.stepExec.WaitForCompletion()
 	logging.V(4).Infof("deploymentExecutor.Execute(...): step executor has completed")

@@ -411,25 +414,34 @@ func (ex *deploymentExecutor) performDeletes(
 func (ex *deploymentExecutor) handleSingleEvent(event SourceEvent) result.Result {
 	contract.Requiref(event != nil, "event", "must not be nil")

-	var steps []Step
-	var res result.Result
 	switch e := event.(type) {
-	case RegisterResourceEvent:
-		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received RegisterResourceEvent")
-		steps, res = ex.stepGen.GenerateSteps(e)
-	case ReadResourceEvent:
-		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received ReadResourceEvent")
-		steps, res = ex.stepGen.GenerateReadSteps(e)
 	case RegisterResourceOutputsEvent:
 		logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received register resource outputs")
 		return ex.stepExec.ExecuteRegisterResourceOutputs(e)
 	}

-	if res != nil {
-		return res
-	}
+	ex.workers.Add(1)
+	go func() {
+		defer ex.workers.Done()
+
+		var steps []Step
+		var res result.Result
+		switch e := event.(type) {
+		case RegisterResourceEvent:
+			logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received RegisterResourceEvent")
+			steps, res = ex.stepGen.GenerateSteps(e)
+		case ReadResourceEvent:
+			logging.V(4).Infof("deploymentExecutor.handleSingleEvent(...): received ReadResourceEvent")
+			steps, res = ex.stepGen.GenerateReadSteps(e)
+		}
+
+		if res != nil {
+			return
+		}
+
+		ex.stepExec.ExecuteSerial(steps)
+	}()

-	ex.stepExec.ExecuteSerial(steps)
 	return nil
 }

@@ -453,6 +465,7 @@ func (ex *deploymentExecutor) importResources(
 		preview:    preview,
 	}
 	res := importer.importResources(ctx)
+	ex.workers.Wait()
 	stepExec.SignalCompletion()
 	stepExec.WaitForCompletion()

@@ -503,6 +516,7 @@ func (ex *deploymentExecutor) refresh(callerCtx context.Context, opts Options, p
 	ctx, cancel := context.WithCancel(callerCtx)
 	stepExec := newStepExecutor(ctx, cancel, ex.deployment, opts, preview, true)
 	stepExec.ExecuteParallel(steps)
+	ex.workers.Wait()
 	stepExec.SignalCompletion()
 	stepExec.WaitForCompletion()

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@aschleck aschleck added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Mar 4, 2023
@justinvp justinvp added impact/performance Something is slower than expected area/engine Pulumi engine and removed needs-triage Needs attention from the triage team labels Mar 7, 2023
@justinvp
Copy link
Member

justinvp commented Mar 7, 2023

Thanks for opening an issue and for the analysis, @aschleck! I'm going to discuss with the team to see what we could do to improve this.

@aschleck
Copy link
Author

Thank you @justinvp for taking a look at this! Have you folks had any opportunity to discuss what a solution might look like? I'm considering taking a stab at fixing it (since this is shockingly impactful) but I wouldn't want to spend time trying approaches that the team knows will fail.

@gj
Copy link

gj commented Mar 23, 2023

Hey @justinvp, any update from the team? I believe this is contributing to extremely slow preview runs for a stack with 1,312 resources.

@RobbieMcKinstry
Copy link
Contributor

Hello! I spoke with Justin and the team late last week.

@aschleck, we would accept a community PR for this feature! Speaking for myself, I'd love to see a community contribution, especially one this impactful. 😄

However, I just want to issue a word of caution. This change is at the very heart of the engine. When we make changes in this area, we tread very gingerly because any change here will impact every user of Pulumi. We expect the bar to be very high on a PR for this change, and it might take some significant back-and-forth to get it merged. I also received feedback that the team expects the change to be difficult, difficult enough that it's outside of the realm of what we would normally expect to see from community contributions.

I don't say any of this to dissuade you, of course! 😸 Like I said, we're happy to receive contributions, and I'd love to see a performance improvement like this come in! We just wanted to set expectations ahead of time. :-) I'd hate for you to have a bad experience as a contributor if you submitted a PR and felt like we were stringing you along!

@RobbieMcKinstry
Copy link
Contributor

FWIW, I think your approach using a WaitGroup is my first instinct as well. I think your assessment of the situation is great (though, others know this section of the codebase a lot better than me!)

The part I know is wrong is that errors are hidden, and instead of Pulumi terminating on an error it just hangs.

Yeah, that makes sense. My gut says you can return the errors over a channel, and close the channel after the wait group completes. I think it might be a little more complicated than that, because you'll have to aggregate the errors into a slice. So as not to block on a full channel from a Diffing goroutine, you'll also need to read from the channel concurrently.

I think it is likely unsafe in other ways (one guess: it crashes sometimes with fatal error: concurrent map read and map write.)

We've seen that error before! We typically enable race detection with the -race and hunt down the problem. I've seen the issue occur during diff calculation, when calculating olds and news.

@mikhailshilkov mikhailshilkov added the customer/feedback Feedback from customers label Jun 14, 2023
@dixler dixler self-assigned this Jun 23, 2023
@dixler dixler added this to the 0.90 milestone Jun 23, 2023
@dixler dixler linked a pull request Jun 23, 2023 that will close this issue
9 tasks
@dixler dixler removed this from the 0.90 milestone Jun 23, 2023
@Frassle Frassle added kind/enhancement Improvements or new features and removed kind/bug Some behavior is incorrect or out of spec labels Jun 28, 2023
@aschleck
Copy link
Author

I'm sorry for never responding Robbie, I really appreciated your insight and just haven't had time to think about this until today. And major thank you to dixler for getting a PR out in the meantime!

As I understand it, with pulumi/pulumi-kubernetes@6e329f3 my particular problem is solved (a preview now takes 15 seconds instead of 3 minutes.) I defer to you folks on how you want to move forward. Please let me know if it'd be helpful for me to do further profiling and see if there's anything else I can find.

@RobbieMcKinstry
Copy link
Contributor

Hi @aschleck! Good to hear from you! I'm glad my last message two messages didn't deter you from contributing; I was worried I was coming off too strong :)

Super cool! I'm glad Levi and Ramone were able to land such an impactful performance fix.

I'm no longer an OSS maintainer with Pulumi, so I'll also have to defer judgment. My professional opinion is to leave this issue open it identifies a valuable optimization, and it's plausible that other providers (frankly, all providers) would benefit from this change. It might be open for a bit, since it's a gnarly one, but it's worthwhile to keep around as part of #11598

@Frassle
Copy link
Member

Frassle commented Aug 16, 2023

I'm no longer an OSS maintainer with Pulumi, so I'll also have to defer judgment. My professional opinion is to leave this issue open it identifies a valuable optimization, and it's plausible that other providers (frankly, all providers) would benefit from this change.

I agree, this will be hard but is worth looking into.

@raelix
Copy link

raelix commented Dec 5, 2023

Hey @justinvp I saw this ticket and I tested some of the interesting suggestion provided above. I can confirm that by calling the handleSingleEvent in a different go routine increase the pulumi performances of at least 6x.
It means that on a stack with ~2k resources it completes in 6m instead of 30m.
Frankly, this could be (at least for me and maybe for who should handle dozens of resources) a fundamental factor to decide to go in production with Pulumi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine Pulumi engine customer/feedback Feedback from customers impact/performance Something is slower than expected kind/enhancement Improvements or new features
Projects
None yet
8 participants