Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(run): provides RunWithGracefulShutdown #564

Closed
wants to merge 1 commit into from

Conversation

liufuyang
Copy link
Contributor

@liufuyang liufuyang commented Nov 16, 2023

So to allow users easily wire up a clean-up function that can be called BEFORE the root context is done.

closes #563


So with typical cloudrunner code initiated like this:

Before:

        err := cloudrunner.Run(func(ctx context.Context) error {
		grpcServer := cloudrunner.NewGRPCServer(ctx)
		cleanup := func() {
			logger := cloudrunner.Logger(ctx)
			select {
			case <-ctx.Done():
				logger.Warn("Root context is already done, cannot use client anymore.")
			default:
				logger.Info("Root context is still live.")
			}
		}

		defer cleanup()
		cloudrunner.Logger(ctx).Info("hello world")

		return cloudrunner.ListenGRPC(ctx, grpcServer)
	})

Output when cancel the process:

2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run.go:173	up and running	{"config": {"cloudrunner": {"PORT": 8080, "K_SERVICE": "___Test_RunWithGracefulShutdown_helloWorld_ctx_cancel_should_before_clean_up_function_call2_in_go_einride_tech_cloudrunner.test", "K_REVISION": "", "K_CONFIGURATION": "", "CLOUD_RUN_JOB": "", "CLOUD_RUN_EXECUTION": "", "CLOUD_RUN_TASK_INDEX": 0, "CLOUD_RUN_TASK_ATTEMPT": 0, "CLOUD_RUN_TASK_COUNT": 0, "GOOGLE_CLOUD_PROJECT": "", "RUNTIME_SERVICEACCOUNT": "", "SERVICE_VERSION": "", "LOGGER_DEVELOPMENT": true, "LOGGER_LEVEL": "debug", "LOGGER_REPORTERRORS": false, "PROFILER_ENABLED": false, "PROFILER_MUTEXPROFILING": false, "PROFILER_ALLOCFORCEGC": true, "TRACEEXPORTER_ENABLED": false, "TRACEEXPORTER_TIMEOUT": "10s", "TRACEEXPORTER_SAMPLEPROBABILITY": 0.01, "METRICEXPORTER_ENABLED": false, "METRICEXPORTER_INTERVAL": "1m0s", "METRICEXPORTER_RUNTIMEINSTRUMENTATION": false, "METRICEXPORTER_HOSTINSTRUMENTATION": false, "METRICEXPORTER_OPENCENSUSPRODUCER": false, "SERVER_TIMEOUT": "4m50s", "CLIENT_TIMEOUT": "10s", "CLIENT_RETRY_ENABLED": true, "CLIENT_RETRY_INITIALBACKOFF": "200ms", "CLIENT_RETRY_MAXBACKOFF": "1m0s", "CLIENT_RETRY_MAXATTEMPTS": 5, "CLIENT_RETRY_BACKOFFMULTIPLIER": 2, "CLIENT_RETRY_RETRYABLESTATUSCODES": ["Unavailable", "Unknown"], "REQUESTLOGGER_MESSAGESIZELIMIT": 0, "REQUESTLOGGER_STATUSTOLEVEL": null}}, "resource": {"telemetry.sdk.language": "go", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.20.0"}, "buildInfo": {"mainPath": "", "goVersion": "go1.21.3", "buildSettings": {}}}
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run_test.go:117	hello world
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/run.go:220	watching for termination signals
2023-11-16T16:35:50.491+0100	info	cloudrunner-go/grpcserver.go:74	gRPCServer listening	{"address": ":8080"}
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:227	got signal:	{"signal": "interrupt"}
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:229	cloudrunner.Shutdown is not used. Canceling root context directly. Call RunWithGracefulShutdown(...) to enable graceful shutdown if preferred.
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/grpcserver.go:71	gRPCServer shutting down
2023-11-16T16:35:54.624+0100	warn	cloudrunner-go/run_test.go:110	Root context is already done, cannot use client anymore.
2023-11-16T16:35:54.624+0100	info	cloudrunner-go/run.go:202	goodbye

Note

Noticing it says Root context is already done, cannot use client anymore.


After:

        err := cloudrunner.RunWithGracefulShutdown(func(ctx context.Context, shutdown *cloudrunner.Shutdown) error {
		grpcServer := cloudrunner.NewGRPCServer(ctx)
		cleanup := func() {
			logger := cloudrunner.Logger(ctx)
			select {
			case <-ctx.Done():
				logger.Warn("Root context is already done, cannot use client anymore.")
			default:
				logger.Info("Root context is still live.")
			}
		}

		shutdown.RegisterCancelFunc(cleanup)
		cloudrunner.Logger(ctx).Info("hello world")

		return cloudrunner.ListenGRPC(ctx, grpcServer)
	})

Output when cancel the process:

2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run.go:173	up and running	{"config": {"cloudrunner": {"PORT": 8080, "K_SERVICE": "___Test_RunWithGracefulShutdown_helloWorld_ctx_cancel_should_before_clean_up_function_call2_in_go_einride_tech_cloudrunner.test", "K_REVISION": "", "K_CONFIGURATION": "", "CLOUD_RUN_JOB": "", "CLOUD_RUN_EXECUTION": "", "CLOUD_RUN_TASK_INDEX": 0, "CLOUD_RUN_TASK_ATTEMPT": 0, "CLOUD_RUN_TASK_COUNT": 0, "GOOGLE_CLOUD_PROJECT": "", "RUNTIME_SERVICEACCOUNT": "", "SERVICE_VERSION": "", "LOGGER_DEVELOPMENT": true, "LOGGER_LEVEL": "debug", "LOGGER_REPORTERRORS": false, "PROFILER_ENABLED": false, "PROFILER_MUTEXPROFILING": false, "PROFILER_ALLOCFORCEGC": true, "TRACEEXPORTER_ENABLED": false, "TRACEEXPORTER_TIMEOUT": "10s", "TRACEEXPORTER_SAMPLEPROBABILITY": 0.01, "METRICEXPORTER_ENABLED": false, "METRICEXPORTER_INTERVAL": "1m0s", "METRICEXPORTER_RUNTIMEINSTRUMENTATION": false, "METRICEXPORTER_HOSTINSTRUMENTATION": false, "METRICEXPORTER_OPENCENSUSPRODUCER": false, "SERVER_TIMEOUT": "4m50s", "CLIENT_TIMEOUT": "10s", "CLIENT_RETRY_ENABLED": true, "CLIENT_RETRY_INITIALBACKOFF": "200ms", "CLIENT_RETRY_MAXBACKOFF": "1m0s", "CLIENT_RETRY_MAXATTEMPTS": 5, "CLIENT_RETRY_BACKOFFMULTIPLIER": 2, "CLIENT_RETRY_RETRYABLESTATUSCODES": ["Unavailable", "Unknown"], "REQUESTLOGGER_MESSAGESIZELIMIT": 0, "REQUESTLOGGER_STATUSTOLEVEL": null}}, "resource": {"telemetry.sdk.language": "go", "telemetry.sdk.name": "opentelemetry", "telemetry.sdk.version": "1.20.0"}, "buildInfo": {"mainPath": "", "goVersion": "go1.21.3", "buildSettings": {}}}
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run.go:220	watching for termination signals
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/run_test.go:117	hello world
2023-11-16T16:33:04.269+0100	info	cloudrunner-go/grpcserver.go:74	gRPCServer listening	{"address": ":8080"}
2023-11-16T16:33:43.245+0100	info	cloudrunner-go/run.go:227	got signal:	{"signal": "interrupt"}
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:237	graceful shutdown has begun
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run_test.go:112	Root context is still live.
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:241	Shutdown.shutdownFunc() has finished, meaning we will shutdown cleanly
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:247	exiting by canceling root context due to shutdown signal
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/grpcserver.go:71	gRPCServer shutting down
2023-11-16T16:33:43.247+0100	info	cloudrunner-go/run.go:202	goodbye

Note

Noticing it now says Root context is still live.

@liufuyang liufuyang requested a review from a team as a code owner November 16, 2023 15:37
@liufuyang liufuyang force-pushed the ensure-graceful-shutdown branch 4 times, most recently from f8a9f4a to f66f10b Compare November 16, 2023 15:58
func Run(fn func(context.Context) error, options ...Option) (err error) {
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
Copy link
Contributor Author

@liufuyang liufuyang Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old implementation makes a hard bond between the root context and SIGTERM, which means when a SIGTERM is received, the ctx will be canceled directly and user has no control of it.

New implementation breaks the bond, see trapShutdownSignal() for details. Taking some ideas from this post. https://medium.com/over-engineering/graceful-shutdown-with-go-http-servers-and-kubernetes-rolling-updates-6697e7db17cf

// (default as 30 seconds) above this value to make sure Kubernetes can wait for enough time for graceful shutdown.
// More info see here:
// https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace
const gracefulShutdownMaxGracePeriod = time.Second * 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 10 seconds here as in our preferred CloudRun environment the terminationGracePeriodSeconds seems not configurable and a fixed 10s is specified in the CloudRun doc.

https://cloud.google.com/run/docs/container-contract#instance-shutdown

Perhaps I should put this in the comment as well.

@liufuyang liufuyang requested a review from odsod November 17, 2023 09:03
@liufuyang liufuyang changed the title feat(run): provides RunWithGracefulShutdownHook feat(run): provides RunWithGracefulShutdown Nov 19, 2023
So to allow user easily wire up a clean up function
that can be called BEFORE the root context is done.
@quoral
Copy link
Contributor

quoral commented Jan 5, 2024

Hey @liufuyang!

We've recently run into similar issues that this PR attempts to solve, and we've merged a very preliminary solution in #585. It does not solve the use-case for some things that you are aiming to solve in this PR however, like pull-based PubSub cloudrun instances.

There's work planned to implement those fixes down the line as well, so I'll keep you in the loop on how that goes with an upcoming ADR where we can discuss if that solution approaches a solution for your usecase. For now as I think we're gonna take that implementation in a different direction than these proposed changes, I'll close this PR and we can move the future discussions over to #563.

Thanks again!

@quoral quoral closed this Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better graceful shutdown
2 participants