Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeebe can backup its data to an external storage without downtime and restore from it #9606

Closed
53 of 58 tasks
deepthidevaki opened this issue Jun 24, 2022 · 3 comments
Closed
53 of 58 tasks
Assignees
Labels
kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues kind/feature Categorizes an issue or PR as a feature, i.e. new behavior

Comments

@deepthidevaki
Copy link
Contributor

deepthidevaki commented Jun 24, 2022

As a user (operator), I can take backup of a zeebe cluster to and external storage and restore from a backup.

We will implement ZEP: Take backup of zeebe cluster without downtime.

Goal:

  • Zeebe will expose an API by which an operator can take backups and manage backups.
  • Zeebe will provide a way to restore from a given backup

Addition info

Task breakdown

To discuss and decide

  • Decide gateway endpoint and API
  • Decide which type of storage to support for backup

StreamProcessor

Snapshotting

Backup management

Communication

Broker

Gateway

Restore

Test

  • Add integration test for backup and restore #10387
  • Write tests for different concurrency scenarios
    • Simulate different concurrency scenarios to take backup. Verify that after restore the inconsistent states due to deployment or message correlation do not exist.
  • Test failure scenarios
    • Test scenarios where a leader change/restart happens while taking backup.

Documentation

  • Docs for pause/resume exporting operation
  • Docs for backup apis

Metrics

Out of scope

  • Providing a client for the backup api
  • Supporting multiple types of backup storage
@deepthidevaki deepthidevaki added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues team/distributed labels Jun 24, 2022
@deepthidevaki deepthidevaki self-assigned this Jun 30, 2022
@deepthidevaki
Copy link
Contributor Author

deepthidevaki commented Sep 22, 2022

As we are nearing code freeze for 8.1.0 release, we have left some issues for the next quarter. Following are some issues/improvements that we would not be able to get it in 8.1.0. But we should tackle them sooner than later.

  • If the given ID refers to a failed backup (that we know of), take backup command should return an error already, so the user knows to retry with a higher ID. To allow this we may have to let the Backup manager send response for take backup request instead of StreamProcessor. This will let us add more info to the take backup response - such as backup status if it already exists etc.
  • Add more details to the existing take backup and get status api responses. The openApi spec describes our desired api.
  • Implement api to list all backups
  • Implement delete backup api in gateway #10209
  • Change restore process (to be discussed).
    • Add api to restore and monitor progress of restore in gateway
    • Users can restore the cluster from a specific backup by
      1. Start zeebe cluster with a clean state
      2. List all completed backups => choose which backup to restore from
      3. send POST restore/backupId to gateway
      4. Monitor progress of restore => failed | inprogress | completed
  • Make PartitionStatus actuator available in gateway. This will be useful to monitor status of pause/resume exporting
  • Allow retaining more than one snapshot #10206
  • Add grafana panels for the backup metrics. feat: add backup metrics to grafana dashboard #10521

@oleschoenburg
Copy link
Member

oleschoenburg commented Sep 29, 2022

Future Improvement to the S3 backup store implementation:

  • Transparent backup compression
  • Verify checksums to ensure that backups are not modified or corrupted
  • Restructure object layout for more efficient listing of backups
  • Ensure compatibility with advanced S3 features such as object locking and versioning

@deepthidevaki
Copy link
Contributor Author

Closing this. A new issue will be created to keep track of the pending tasks and improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/epic Categorizes an issue as an umbrella issue (e.g. OKR) which references other, smaller issues kind/feature Categorizes an issue or PR as a feature, i.e. new behavior
Projects
None yet
Development

No branches or pull requests

3 participants