Scheduled rollout #727

Athosone · 2022-03-08T19:29:00Z

Feature description

This feature allows you to plan a deployment at a given period.

A period is composed of a cron and a duration.

Both of these properties represents a window during which the application can be deployed.

The schedule can be applied to either the cluster itself and/or the bundle.

If schedules are set on both the cluster and the bundle then the bundle schedule takes priority over the cluster schedule.

We chose to prioritise the bundle over the cluster as we think that the user may want to specify exceptions to the global rule of a cluster schedule.

Usage Example

fleet.yaml

defaultNamespace: default
namespace: default

schedule:
  cron: "0 16 * * *"
  duration: "1h"

cluster

apiVersion: fleet.cattle.io/v1alpha1
kind: Cluster
metadata:
  name: c-kd88w
  namespace: fleet-default
spec:
  paused: false
  deploymentSchedule:
    cron: "0 16 * * *"
    duration: "1h"

Motivation for this PR

We noticed there was an existing PR #450 for this feature related to #383, but saw that there was no activity on it.

Also there were two implementation details that were not right in our opinion:

If the agent was triggered during the schedule window, it would be skipped and be scheduled for the next. If your window was large (days, weeks, months), it would take a while for it to be installed.
If the bundle has already been scheduled (https://github.com/rancher/fleet/pull/450/files#diff-8b83e3ec81ef037af3e68026d07535637ca1d84af565be9bcda91135c4b80714R111), the nextRun was evaluated using the ScheduledAt and not computed using the cron and duration. This would result in the bundle being scheduled for the nextRun even if the cron and duration did not match the ScheduledAt anymore.

Co-authored-by: Christian Artin <gravufo@gmail.com>

gravufo · 2022-03-08T19:32:08Z

@ibrokethecloud @nickgerace FYI, tagging you because you were involved in the previous PR.

Co-authored-by: Christian Artin <gravufo@gmail.com>

richard-cox · 2022-03-22T08:58:50Z

I've raised this with the Fleet team, they're aware and will start working through community PRs soon.

SheilaghM · 2022-03-23T21:31:20Z

Tagging @ibrokethecloud - will you please review this as it brings up a two instances not covered in yours and advise whether yours should be closed in favor of this one?

gravufo · 2022-05-06T14:26:10Z

@SheilaghM Any news? We are now in May and we still have no feedback after 2 months.

ibrokethecloud · 2022-05-09T04:33:07Z

Hi @Athosone thanks a lot for your PR.
There are few minor tweaks needed which should make the agent more efficient in scheduling the changes.

@SheilaghM can we please have some clarity on whether cluster scoped schedule supersedes the bundle scoped one or are we happy to have the bundle schedule have priority.

Athosone · 2022-05-09T13:40:00Z

Hi :)!

Sure no problem let me know I'll fix it !

Athosone · 2022-05-16T13:51:56Z

Hey :)!

Did you have the time to think about this PR?

@ibrokethecloud
@SheilaghM

SheilaghM · 2022-05-25T15:44:02Z

@Athosone - We are still discussing the Fleet use cases with Product Management. We will act on this one way or another as soon as we have clarity.

manno · 2022-06-14T12:16:17Z

docs/gitrepo-structure.md

@@ -53,6 +53,17 @@ defaultNamespace: default
 # Default: ""
 namespace: default

+# Specify a deployment schedule to deploy bundle during that window.
+# In this example we will deploy bundles every monday from 4pm to 5pm.
+# If a helm timeout is specified in the helm structure below, it will be considered in the schedule evaluation.


I'm not too familiar with the helm code. I'm a bit concerned this will not be obvious to users, maybe we need to log that the "job will not be executed, because its timeout is too large for the schedule window".

But is it really necessary to take the helm timeout into account? As far as I understand helm tasks are never cancelled by, so we rely on the helm library to honor the timeout value. If timeout is 0, wait is not used, are we certain helm will finish immediately in that case? I feel like it must be possible to make it take very long, by using hooks or lookups maybe?
In any case the actual helm deployment going on in the cluster might take longer. So even when you take the tiemout into consideration, there is no guarantee the cluster is in a ready state after the schedule window?

Hi !

The way we approached it is, if, as a user I don't specify a timeout, it means I do not care how long the installation take (and helm default it to 5 min).
It could also means that you know that there is no hook/lookups and it will be installed instantly. It would not be accurate to reserve a five minutes block for that case.

On the other hand, a user that wants to have some kind of mechanism to prevent an installation from taking too much time, could specify a timeout in order for helm to cancel the installation (and maybe rollback if atomic is specified).

As you said, we cannot guarantee the cluster will be in a ready state after the schedule window, it is a best effort way of doing it.

I tested it with a simple chart including hooks. The post-install hook includes a job that sleep for 305 seconds.

I launched the install by using:

helm install httpbin .

The result is:

Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition

the job:

apiVersion: batch/v1 kind: Job metadata: name: sleep annotations: helm.sh/hook: post-install spec: template: spec: containers: - name: sleep image: busybox:1.28 command: ["sh", "-c", "echo Hello ! && sleep 350"] restartPolicy: Never backoffLimit: 4

Hm, okay. Best effort is fine with me.

However, I was looking at the helm code:

fleet/pkg/helmdeployer/deployer.go

Lines 328 to 333 in 5a141a6

u.Timeout = timeout

u.DryRun = dryRun

u.PostRenderer = pr

if u.Timeout > 0 {

u.Wait = true

}

and I think the timeout defaults to 0 seconds. The helm package then uses that timeout only to wait for hooks and interprets 0 as forever (ContextWithOptionalTimeout).

So, if you tested your hook chart without a timeout, I think it would run for 350s.

Indeed I see what you mean, specifying --timeout=0 run until it completes. It looks like fleet is setting 0 implicitly instead of the default 300 sec of helm, thus it will run forever and ever if the user do not specify a timeout

CiraciNicolo · 2024-05-02T11:45:52Z

Any news on this?

Athosone · 2024-05-02T18:08:12Z

We could rebase it and try to push it again if the fleet team is still interested in the contrib

Athosone and others added 3 commits March 8, 2022 10:31

Update API and generate CRDs

d0b711d

Co-authored-by: Christian Artin <gravufo@gmail.com>

Implement scheduled rollout

1693e0d

Co-authored-by: Christian Artin <gravufo@gmail.com>

Added documentation

d652282

Co-authored-by: Christian Artin <gravufo@gmail.com>

Fix CI issue

86d28e7

Co-authored-by: Christian Artin <gravufo@gmail.com>

Athosone force-pushed the scheduled-rollout branch from f659912 to 86d28e7 Compare March 8, 2022 20:09

Removed cast following update of the cron library

3272424

SheilaghM requested review from prachidamle and ibrokethecloud March 23, 2022 21:05

SheilaghM added the area/fleet label Mar 23, 2022

SheilaghM added the priority/2 label Mar 23, 2022

gravufo mentioned this pull request May 6, 2022

5/6 Engineering Office Hours rancher/rancher#37540

Closed

luthermonson added team/area3 [zube]: To Triage labels May 6, 2022

manno reviewed Jun 14, 2022

View reviewed changes

manno mentioned this pull request Jun 17, 2022

Document timeout Option for Bundle #801

Open

zube bot removed the team/area3 label Jul 5, 2022

zube bot added the team/fleet label Jul 26, 2022

kkaempf added kind/enhancement area/scheduler labels Dec 6, 2022

kkaempf mentioned this pull request Dec 7, 2022

[Sticky] Sprint Goals 🏑 #1162

Open

3 tasks

kkaempf added the status/later label Dec 7, 2022

kkaempf removed the priority/2 label May 25, 2023

ibrokethecloud removed their request for review June 7, 2023 23:08

manno removed the [zube]: To Triage label Apr 3, 2024

kkaempf removed team/fleet labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled rollout #727

Scheduled rollout #727

Athosone commented Mar 8, 2022 •

edited by zube bot

gravufo commented Mar 8, 2022

richard-cox commented Mar 22, 2022

SheilaghM commented Mar 23, 2022

gravufo commented May 6, 2022

ibrokethecloud commented May 9, 2022

Athosone commented May 9, 2022

Athosone commented May 16, 2022 •

edited

SheilaghM commented May 25, 2022

manno Jun 14, 2022

Athosone Jun 14, 2022

manno Jun 14, 2022

Athosone Jun 14, 2022 •

edited

CiraciNicolo commented May 2, 2024 •

edited

Athosone commented May 2, 2024

	u.Timeout = timeout
	u.DryRun = dryRun
	u.PostRenderer = pr
	if u.Timeout > 0 {
	u.Wait = true
	}

Scheduled rollout #727

Are you sure you want to change the base?

Scheduled rollout #727

Conversation

Athosone commented Mar 8, 2022 • edited by zube bot

Feature description

Usage Example

fleet.yaml

cluster

Motivation for this PR

gravufo commented Mar 8, 2022

richard-cox commented Mar 22, 2022

SheilaghM commented Mar 23, 2022

gravufo commented May 6, 2022

ibrokethecloud commented May 9, 2022

Athosone commented May 9, 2022

Athosone commented May 16, 2022 • edited

SheilaghM commented May 25, 2022

manno Jun 14, 2022

Choose a reason for hiding this comment

Athosone Jun 14, 2022

Choose a reason for hiding this comment

manno Jun 14, 2022

Choose a reason for hiding this comment

Athosone Jun 14, 2022 • edited

Choose a reason for hiding this comment

CiraciNicolo commented May 2, 2024 • edited

Athosone commented May 2, 2024

Athosone commented Mar 8, 2022 •

edited by zube bot

Athosone commented May 16, 2022 •

edited

Athosone Jun 14, 2022 •

edited

CiraciNicolo commented May 2, 2024 •

edited