-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to automate and schedule backups #553
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
VitessBackupSchedule
add the ability to automate backups
VitessBackupSchedule
add the ability to automate backupsSigned-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
--- | ||
apiVersion: scheduling.k8s.io/v1 | ||
description: The vitess-operator control plane. | ||
globalDefault: false | ||
kind: PriorityClass | ||
metadata: | ||
name: vitess-operator-control-plane | ||
value: 5000 | ||
--- | ||
apiVersion: scheduling.k8s.io/v1 | ||
description: Vitess components (vttablet, vtgate, vtctld, etcd) | ||
globalDefault: false | ||
kind: PriorityClass | ||
metadata: | ||
name: vitess | ||
value: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to delete these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not removing this, just moving it earlier in the file. I am using the raw output of kustomize
, which I think is what we should do to avoid conflict in the future.
--- | ||
apiVersion: scheduling.k8s.io/v1 | ||
description: The vitess-operator control plane. | ||
globalDefault: false | ||
kind: PriorityClass | ||
metadata: | ||
name: vitess-operator-control-plane | ||
value: 5000 | ||
--- | ||
apiVersion: scheduling.k8s.io/v1 | ||
description: Vitess components (vttablet, vtgate, vtctld, etcd) | ||
globalDefault: false | ||
kind: PriorityClass | ||
metadata: | ||
name: vitess | ||
value: 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #553 (comment)
var watchResources = []client.Object{ | ||
&kbatch.Job{}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is only being used once, that too in a for loop when it only has 1 value. Why don't we unfurl that value and use that directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just re-used the same pattern we use throughout the codebase. In other places, even if there is a single element, we use it this way. Making it easy to understand all the resources created by a given controller when reading the top of the file.
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
return job, nil | ||
} | ||
|
||
func (r *ReconcileVitessBackupsSchedule) createJobPod(ctx context.Context, vbsc *planetscalev2.VitessBackupSchedule, name string) (pod corev1.PodSpec, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we taking the vtctld action timeout into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so, what are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vtctldclient has a default command timeout of 1 hour, which is controlled by the --action-timeout
flag. We will almost certainly want to increase that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or make it a flag, like extraFlags
, on the backup schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is solved as people can set --action-timeout
by adding it to the extraFlags
field of VitessBackupScheduleStrategy
. The flags in set on extraFlags
will be passed down to vtctldclient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth adding a note about that in release notes. i expect it will be a common issue people run in to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed via 9739a94
// This field will be ignored if we have picked the strategy BackupShard. | ||
// +optional | ||
// +kubebuilder:example="zone1-0000000102" | ||
TabletAlias string `json:"tabletAlias"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i kinda question the value of cron-scheduled Backup
:
- it's not uncommon to delete a PVC/Pod because it's in bad shape, and having to update backup schedules anytime that happens will be annoying.
- a image/config rollout may switch a tablet from replica to primary, which means the next time a backup schedule runs it will take the primary offline.
would say it's not worth adding this strategy with these two potential footguns, unless it is a hotly requested feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think it is safe to drop this strategy especially if I add what you are mentioning in #553 (comment). Having two strategies: BackupShard
and BackupCluster
, would give enough flexibility for most use cases. I am thinking we can even add a third BackupKeyspace
strategy, which is a good in-between solution between BackupShard
and BackupCluster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. BackupShard
is a higher level request that can be intuitively and correctly executed over time as the cluster state changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment #553 (comment), which explains the new strategies.
// This field will be ignored if we have picked the strategy BackupTablet. | ||
// +optional | ||
// +kubebuilder:example="commerce/-" | ||
KeyspaceShard string `json:"keyspaceShard,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a shard only has 1 replica, then the BackupShard
strategy will mean all @replica
queries will fail. may be worth having an option like MinHealthyReplicas
for that strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @replica
will fail because the tablet will not be able to respond query will being backed up? In that case, yeah it would make sense to add such parameter for the user to configure, with a default to 2 for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. At least when not using online backup methods like xtrabackup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! I only had some relatively minor comments. I'll come back and do another pass once you've made whatever changes you like based on the comments from myself and Max.
Thanks!
<p>Strategy defines how we are going to take a backup. | ||
If you want to take several backups within the same schedule you can add more items | ||
to the Strategy list. Each VitessBackupScheduleStrategy will be executed by the same | ||
kubernetes job. This is useful if for instance you have one schedule, and you want to | ||
take a backup of all shards in a keyspace and don’t want to re-create a second schedule. | ||
All the VitessBackupScheduleStrategy are concatenated into a single shell command that | ||
is executed when the Job’s container starts.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means / looks like. An example somewhere would be nice IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already an example in the CRD:
// +kubebuilder:example="-"
// +kubebuilder:example="commerce"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In YAML it looks like that, which is what people can find in the examples/tests of the repo:
strategies:
- name: BackupShard
keyspace: "commerce"
shard: "-"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thought, might be nice to give users a way to assign annotations
, and one or more affinity selection options to the backup runner pods. that way they can influence things scheduling and eviction.
for example, users might not want backup runner pods running on the same nodes as vttablet pods. and they might not want the backup runner pods to get evicted by an unrelated pod after they've been running for a long time.
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
In commit bc74ab4, I have applied one of the most important suggestion discussed above which is to remove the # BackupKeyspace
strategies:
- name: BackupKeyspace
cluster: "example"
keyspace: "customer" # BackupCluster
strategies:
- name: BackupCluster
cluster: "example" Meanwhile, the # BackupKeyspace
Args:
/bin/sh
-c
/vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80- # BackupCluster
Args:
/bin/sh
-c
/vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard commerce/- && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80- |
|
||
// Cluster defines on which cluster you want to take the backup. | ||
// This field is mandatory regardless of the chosen strategy. | ||
Cluster string `json:"cluster"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure i follow why this is necessary. my mental model is that a user defines []VitessBackupScheduleTemplate
on the ClusterBackupSpec
, and so implicitly each VitessBackupScheduleStrategy
will be associated with the cluster where ClusterBackupSpec
is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point @maxenglander, it is pretty useless. I ended up removing that field from VitessBackupScheduleStrategy
and adding it to VitessBackupScheduleSpec
. The VitessCluster
controller will come and fill that new field when it create a new VitessBackupSchedule
object, that way VitessBackupSchedule
is still be able to select existing components given their cluster names to avoid fetching wrong data in the event where we have multiple VitessCluster
running in our K8S cluster.
See b30aa09 for the change.
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
In e6946fb I have added affinity and annotations in the |
@@ -40,6 +40,8 @@ const ( | |||
TabletPoolNameLabel = LabelPrefix + "/" + "pool-name" | |||
// TabletIndexLabel is the key for identifying the index of a Vitess tablet within its pool. | |||
TabletIndexLabel = LabelPrefix + "/" + "tablet-index" | |||
// BackupScheduleLabel is the key for identifying to which VitessBackupSchedule a Job belongs to. | |||
BackupScheduleLabel = LabelPrefix + "/" + "backup-schedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be accomplished with an owner reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a little neater, since if you delete a backup schedule, you'd probably also want to delete any currently running pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VitessBackupSchedule
is already the owner of its own jobs, which makes it an owner of the pods that are created by the jobs. When deleting a backup schedule all the associated pods and jobs are also removed.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
...
example-vbsc-every-minute-94d12263-1717100280-9cscg 0/1 ContainerCreating 0 27s
...
$ kubectl get vitessbackupschedule
NAME AGE
example-vbsc-every-minute-94d12263 90s
$ kubectl delete vitessbackupschedule example-vbsc-every-minute-94d12263
vitessbackupschedule.planetscale.com "example-vbsc-every-minute-94d12263" deleted
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
...
example-vbsc-every-minute-94d12263-1717100280-9cscg 1/1 Terminating 1 (31s ago) 67s
...
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
example-90089e05-vitessbackupstorage-subcontroller 1/1 Running 0 2m30s
example-commerce-x-x-vtbackup-init-c6db73c9 0/1 Error 2 (26s ago) 2m26s
example-commerce-x-x-zone1-vtorc-c13ef6ff-5855667dbc-d5dcl 1/1 Running 1 (73s ago) 2m27s
example-etcd-faf13de3-1 1/1 Running 0 2m30s
example-etcd-faf13de3-2 1/1 Running 0 2m30s
example-etcd-faf13de3-3 1/1 Running 0 2m30s
example-vttablet-zone1-0790125915-4e37d9d5 3/3 Running 0 2m27s
example-vttablet-zone1-2469782763-bfadd780 3/3 Running 0 2m27s
example-vttablet-zone1-2548885007-46a852d0 3/3 Running 0 2m27s
example-zone1-vtctld-1d4dcad0-646b8c9c77-vl69s 1/1 Running 1 (73s ago) 2m29s
example-zone1-vtgate-bc6cde92-569bbcf4df-lslph 1/1 Running 1 (72s ago) 2m27s
vitess-operator-6f87b84f88-6wjlt 1/1 Running 0 2m33s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be accomplished with an owner reference?
I am not 100% sure of what you mean by that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, this is what i meant by owner reference 👍
The VitessBackupSchedule is already the owner of its own jobs,
return err | ||
} | ||
if jobStartTime.Add(time.Minute * time.Duration(timeout)).Before(time.Now()) { | ||
if err := r.client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); (err) != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a good thing to have a metric for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return job, nil | ||
} | ||
|
||
func (r *ReconcileVitessBackupsSchedule) createJobPod(ctx context.Context, vbsc *planetscalev2.VitessBackupSchedule, name string) (pod corev1.PodSpec, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth adding a note about that in release notes. i expect it will be a common issue people run in to.
if shardIndex > 0 || ksIndex > 0 { | ||
cmd.WriteString(" && ") | ||
} | ||
createVtctldClientCommand(&cmd, vtctldclientServerArg, strategy.ExtraFlags, ks.name, shard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am i reading this right that it will be taking a backup of each keyspace and shard in sequence? that doesn't seem ideal to me because if each shard takes an hour to backup, and there are 32 shards, then the backup of the first shard and last shard will be more than a day apart.
i think it would be better if there were at least the option of BackupCluster
and BackupKeyspace
to backup all keyspaces and shards in parallel.
might be better to limit this PR to only support BackupShard
for now, and add support for the other options after more consideration into how to implement BackupKeyspace
and BackupCluster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that, remove those two strategies as part of this PR and I will work on a subsequent PR to add them back with a better approach. This PR is getting lengthy already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed via 70ba063
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO BackupAllShardsInKeyspace
and BackupAllShardsInCluster
are better names. It may seem nitty, but I think it's important as it reflects what it actually is: independent backups of the shards. i.e. it is NOT a single consistent backup of the keyspace or cluster at any physical or logical point in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing Keyspace and Cluster strategies in this PR as it will require a bigger refactoring. I am keeping that in mind for when we add them though.
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Description
This Pull Request adds a new CRD called
VitessBackupSchedule
. Its main goal is to automate and schedule backups of Vitess, taking backups of the Vitess cluster at regular intervals based on a given cronschedule
andStrategy
. This new CRD is managed by theVitessCluster
, like most other components of the vitess-operator, theVitessCluster
controller is responsible for the whole lifecycle (creation, update, deletion) of theVitessBackupSchedule
object in the cluster. Inside theVitessCluster
it is possible to define severalVitessBackupSchedule
s as a list, allowing for multiple concurrent backup schedules.Among other things, the
VitessBackupSchedule
object is responsible for creating Kubernetes's Job at the desired time, based on the user-definedschedule
. It also keeps track of older jobs and delete them if they are too old, according to user-defined parameters (successfulJobsHistoryLimit
&failedJobsHistoryLimit
). The jobs created by theVitessBackupSchedule
object will use thevtctld
Docker Image and will execute a shell command that is generated based on the user-definedstrategies
. The end user can define as many backup strategy per schedule, each of them mocks whatvtctldclient
is able to do, theBackup
andBackupShard
commands are available, a map of extra flags enable the user to give as many flag as they want tovtctldclient
.A new end-to-end test is added to our BuildKite pipeline as part of this Pull Request to test the proper behavior of this new CRD.
Related PRs
operator.yaml
and add schedule backup example vitessio/vitess#15969Demonstration
For this demonstration I have setup a Vitess cluster by following the steps in the getting started guide, until the very last step where we must apply the
306_down_shard_0.yaml
file. My cluster is then composed of 2 keyspaces:customer
with 2 shards, andcommerce
unsharded. I then modify the306...
yaml file to contain the new backup schedule, as seen in the snippet right below. We want to create two schedules, one for each keyspace. The keyspacecustomer
will have two backup strategies: one for each shard.Once the cluster is stable, all tablets are serving and ready, I re-apply my yaml file with the backup configuration:
Immidiately I can check that the new
VitessBackupSchedule
objects have been created.Now I want to check the pods where the jobs created by
VitessBackupSchedule
are running. After about 2 minutes, we can see four pods, two for each schedule. The pods are marked asCompleted
as they finished their job.Now let's check our backup: