-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pause exporting endpoint in gateway #9630
Comments
@deepthidevaki One concern I have with this is that there are edge cases where new brokers join the cluster or brokers lose data & restart. In those cases the pause might have succeeded but a new broker will not know about this and continue exporting. I'm wondering if we need a new way to persist the pause state that survives data loss. |
Those are valid points. But let's keep it simple now. New brokers joining is currently not a supported feature. We could also treat data loss as a special case and ignore for it now. Users can always add additional steps to cancel backup process while any of the above cases occurs. Also I'm hoping that we can get rid of the need to pause eventually and be able to take backups without pausing. So it may not be worth the effort to make it perfect. We could always revisit the approach if it is absolutely required. |
There are two cases that are problematic:
I currently see no way to prevent the second case. |
Case 1 must be handled. We can decide what is the semantics of failure. The simplest would be to leave the cluster in a partial state - some brokers have paused the exporting, while others not and let users deal with it. If we want to make it "atomic", it would require more work or a different approach. I would argue that we can ignore case 2 (data loss) now. |
I sort of disagree here. For the user (either human or some backup service) it's probably difficult to tell whether data loss has happened. Although I suppose we could query the exported position after pause and before resuming. If the positions have changed, the backup can be considered invalid. |
The data loss shouldn't be treated like other failures like node-restart or network partition. We can guarantee recovery only in some cases. For example:- any number of nodes can be restarted and the cluster will recover without user intervention. The worse thing that can happen is system is not available for a short time. But, we can guarantee the cluster can recover from data loss only if data loss happens in < quorum replicas. So far this must be ensure by the user and will be following a special process (eg;- as part of maintenance replacing hardware). In that sense the "user" must be aware of when a data loss happens. Even if this process is automated, it wouldn't be difficult to add necessary action to abort/retry ongoing backups. |
That said, if you have some idea how we can handle data loss scenario, we could discuss it. But as I said, I'm not sure if it is worth the effort.
|
First thing that came to mind was to move the pause flag to |
This would be lost after the data loss as well, no? |
Yes, although eventually it'd become part of the snapshot & thus be available again after replication. But since it's not part of the log, it's not enough 😞. We had this idea floating around where exporters are also For now I'll follow your opinion and simply leave everything as is - data loss must be detected & handled by the user. |
I've found one more issue that doesn't involve data loss: Not all brokers are members for all partitions. After a restart, the set of partitions that a broker is a member of can change. That means that we need to pause processing "globally", on a per broker basis and can't re-use the existing pause flag which can only be set for partitions that the broker is a member of. |
When would this happen? Isn't the configuration static? |
Oh, you are right! I thought it might change dynamically but |
With that sorted, I think I have a plan how to implement this:
We could then extend the actuator to make retrying more efficient:
|
Sounds good 👍
What would be in the response to the user? |
I thought we could start with just success/failure via the HTTP status code, let's say 200 OK for success and 503 or 504 for failure. As an extension:
I haven't thought about a json schema for those yet. |
9910: feat: broker requests can address specific brokers r=oleschoenburg a=oleschoenburg ## Description Broker requests can optionally provide a broker id to indicate that this request should be sent to a specific broker, regardless of whether this broker is leader for the partition or not. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #9630 9911: Await until no more marked-for-deletion segments are present r=npepinpe a=npepinpe ## Description This PR allows for a slight delay for segment deletion to occur. The test checks that we close all readers by waiting for segments to be truly deleted; if a reader is left opened, then a "*-deleted" file will be present. Since deletion is asynchronous, we should be a bit lenient and wait for a bit of time. I also took the opportunity to simplify a bit the verification of the test, improving the error message on failure I think. ## Related issues closes #9685 9913: deps(.github): bump slackapi/slack-github-action from 1.19.0 to 1.21.0 r=oleschoenburg a=dependabot[bot] Bumps [slackapi/slack-github-action](https://github.com/slackapi/slack-github-action) from 1.19.0 to 1.21.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/slackapi/slack-github-action/releases">slackapi/slack-github-action's releases</a>.</em></p> <blockquote> <h2>Slack Send V1.21.0</h2> <h2>What's Changed</h2> <ul> <li>updated to 1.21.0, fixed update-ts by <a href="https://github.com/stevengill"><code>`@stevengill</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/110">slackapi/slack-github-action#110</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/slackapi/slack-github-action/compare/v1.20.0...v1.21.0">https://github.com/slackapi/slack-github-action/compare/v1.20.0...v1.21.0</a></p> <h2>Slack Send v1.20.0</h2> <h2>What's Changed</h2> <ul> <li>Bump <code>`@actions/github</code>` from 5.0.1 to 5.0.3 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/94">slackapi/slack-github-action#94</a></li> <li>Bump <code>`@vercel/ncc</code>` from 0.33.4 to 0.34.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/98">slackapi/slack-github-action#98</a></li> <li>Bump eslint-plugin-jsdoc from 39.2.9 to 39.3.2 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/96">slackapi/slack-github-action#96</a></li> <li>Bump eslint from 8.14.0 to 8.16.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/99">slackapi/slack-github-action#99</a></li> <li>Bump <code>`@actions/core</code>` from 1.7.0 to 1.8.2 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/100">slackapi/slack-github-action#100</a></li> <li>Bump mocha from 9.2.2 to 10.0.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/97">slackapi/slack-github-action#97</a></li> <li>Bump sinon from 13.0.2 to 14.0.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/95">slackapi/slack-github-action#95</a></li> <li>README.md: Fix typo by <a href="https://github.com/cclauss"><code>`@cclauss</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/102">slackapi/slack-github-action#102</a></li> <li>Bump <code>`@actions/core</code>` from 1.8.2 to 1.9.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/105">slackapi/slack-github-action#105</a></li> <li>Bump eslint from 8.16.0 to 8.18.0 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/106">slackapi/slack-github-action#106</a></li> <li>Bump eslint-plugin-jsdoc from 39.3.2 to 39.3.3 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/107">slackapi/slack-github-action#107</a></li> <li>Bump <code>`@slack/web-api</code>` from 6.7.1 to 6.7.2 by <a href="https://github.com/dependabot"><code>`@dependabot</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/108">slackapi/slack-github-action#108</a></li> <li>Add support to update messages <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/75">#75</a> by <a href="https://github.com/kuboon"><code>`@kuboon</code></a>` in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/109">slackapi/slack-github-action#109</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/cclauss"><code>`@cclauss</code></a>` made their first contribution in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/102">slackapi/slack-github-action#102</a></li> <li><a href="https://github.com/kuboon"><code>`@kuboon</code></a>` made their first contribution in <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/pull/109">slackapi/slack-github-action#109</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/slackapi/slack-github-action/compare/v1.19.0...v1.20.0">https://github.com/slackapi/slack-github-action/compare/v1.19.0...v1.20.0</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/slackapi/slack-github-action/commit/936158bbe252e9a6062e793ea4609642c966e302"><code>936158b</code></a> Automatic compilation</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/c1472d49d7e64ec13e48b5c157571fff3f06998e"><code>c1472d4</code></a> updated to 1.21.0, fixed update-ts (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/110">#110</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/3a354251e498618617a633aa3f9ab290ec9af847"><code>3a35425</code></a> updated to version 1.20.0</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/18a00ad4b3af3ebddb1643a887416468c908eb42"><code>18a00ad</code></a> Add support to update messages <a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/75">#75</a> (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/109">#109</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/05390e63d05f0c6bf1df4e61025724c46e785a66"><code>05390e6</code></a> Bump <code>`@slack/web-api</code>` from 6.7.1 to 6.7.2 (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/108">#108</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/e3dd3aedd3fd10ac3702a4a049787022b9d38d05"><code>e3dd3ae</code></a> Bump eslint-plugin-jsdoc from 39.3.2 to 39.3.3 (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/107">#107</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/8ff81266566f71b66f982dedefc9a295601e94e6"><code>8ff8126</code></a> Bump eslint from 8.16.0 to 8.18.0 (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/106">#106</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/5253b6aa4bc935036a31c7efcd03656a4f4eb731"><code>5253b6a</code></a> Bump <code>`@actions/core</code>` from 1.8.2 to 1.9.0 (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/105">#105</a>)</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/54ab0a40d2dce881a507b3044df99291e75ff013"><code>54ab0a4</code></a> README.md: Fix typo</li> <li><a href="https://github.com/slackapi/slack-github-action/commit/0f34a3c3a952c279307c91ae7c9c3e181a1eb3a1"><code>0f34a3c</code></a> Bump sinon from 13.0.2 to 14.0.0 (<a href="https://github-redirect.dependabot.com/slackapi/slack-github-action/issues/95">#95</a>)</li> <li>Additional commits viewable in <a href="https://github.com/slackapi/slack-github-action/compare/v1.19.0...v1.21.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=slackapi/slack-github-action&package-manager=github_actions&previous-version=1.19.0&new-version=1.21.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> 9914: deps(go): bump google.golang.org/protobuf from 1.28.0 to 1.28.1 in /clients/go r=oleschoenburg a=dependabot[bot] Bumps [google.golang.org/protobuf](https://github.com/protocolbuffers/protobuf-go) from 1.28.0 to 1.28.1. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/6875c3d7242d1a3db910ce8a504f124cb840c23a"><code>6875c3d</code></a> all: release v1.28.1</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/881da6ece0696714305aee39afcac17b6abe8e00"><code>881da6e</code></a> all: Add prebuild binaries for arm64</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/2a74a0e82391d1fc344067bc6bfc4647cd8ae7d9"><code>2a74a0e</code></a> A+C: delete AUTHORS and CONTRIBUTORS</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/de9682ad1656cd1ee6808865f7a42f5a5d79eb73"><code>de9682a</code></a> internal/impl: improve MessageInfo.New performance</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/b0a944684d78a6b7c1f2bb6864fa910385d5b97a"><code>b0a9446</code></a> all: reformat with go1.19 gofmt</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/c1bbc5d45b2b8a056496c3db183b24e7f9a22ff1"><code>c1bbc5d</code></a> all: make integration test work on darwin/arm64</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/5f429f7042ff6743d1cd695281802b8c9b8d9395"><code>5f429f7</code></a> proto: fix compilation failure in tests</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/fc44d00d5a99973a6deb0863d6aaa7446c016066"><code>fc44d00</code></a> proto: use reflect.Ptr for backward compatibility</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/380c339ec12310c04cd1c399669ab0bce57b6cd6"><code>380c339</code></a> proto: short-circuit Equal when inputs are identical</li> <li><a href="https://github.com/protocolbuffers/protobuf-go/commit/784c4825545540dc41a1dc85715d3251903bc8ce"><code>784c482</code></a> all: remove shorthand import aliases</li> <li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf-go/compare/v1.28.0...v1.28.1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=google.golang.org/protobuf&package-manager=go_modules&previous-version=1.28.0&new-version=1.28.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting ``@dependabot` rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - ``@dependabot` rebase` will rebase this PR - ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it - ``@dependabot` merge` will merge this PR after your CI passes on it - ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it - ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging - ``@dependabot` reopen` will reopen this PR if it is closed - ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Nicolas Pepin-Perreault <nicolas.pepin-perreault@camunda.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
9926: New `AdminRequest` type to pause exporting r=oleschoenburg a=oleschoenburg ## Description Adds a new request type, `PAUSE_EXPORTING` and adds an optional broker id for all admin requests. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #9630 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
9945: Broker can handle admin request to pause exporting r=oleschoenburg a=oleschoenburg ## Description This started as a small PR to just add support for `PAUSE_EXPORTING` in `AdminApiRequestHandler`. However, there were a couple of changes necessary to implement this properly: * Single partition admin access. `AdminAccess`'s scope can now be narrowed to a specific partition. * `AdminApiService` is now started after the partition manager has started. This allows for the api service to immediately have access to `AdminAccess`. * Api request handlers can be asynchronous. I've also included a somewhat large refactoring of the existing admin api tests. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #9630 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
9945: Broker can handle admin request to pause exporting r=oleschoenburg a=oleschoenburg ## Description This started as a small PR to just add support for `PAUSE_EXPORTING` in `AdminApiRequestHandler`. However, there were a couple of changes necessary to implement this properly: * Single partition admin access. `AdminAccess`'s scope can now be narrowed to a specific partition. * `AdminApiService` is now started after the partition manager has started. This allows for the api service to immediately have access to `AdminAccess`. * Api request handlers can be asynchronous. I've also included a somewhat large refactoring of the existing admin api tests. ## Related issues <!-- Which issues are closed by this PR or are related --> relates to #9630 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
I'm shelving this until we triaged #9996, maybe writing the actual actuator would be easier that way 🙏 |
10409: feat(backup): add metrics for checkpoints r=deepthidevaki a=deepthidevaki ## Description Added following metrics: - number of checkpoint records processed per partition. Can be filtered by `result = created` or `result = ignored`. - last checkpoint id and last checkpoint position per partition. It is updated in both leader and followers. PS:- This PR does not add panels to grafana dashboard. ## Related issues closes #10385 10410: feat: gateway endpoint to pause exporting r=oleschoenburg a=oleschoenburg Adds an actuator `/exporting` that supports a `pause` operation. On `POST /actuator/exporting/pause`, the current topology is checked for completeness (all brokers available, all partitions have the expected number of members). If this fails, exporting is not paused and the request returns status code 500. If the topology is complete, the actuator sends pause requests to all `(broker, partition)` pairs. Only if all requests succeed we return status code 204, otherwise 500. This also fixes a bug where pausing exporting would fail if no exporters are configured. closes #9630 10415: fix(ci): use bash shell for maven run r=megglos a=megglos ## Description Ensures a fail fast behavior as this includes [github setting](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#exit-codes-and-error-action-preference) `set -eo pipefail` which makes sure failures in a pipe chain are not lost and any maven test failure causes the build to fail. ## Related issues closes #10260 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com> Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
We must pause exporting while taking backup. See doc .
Currently pause exporting is exposed in broker . But we would like to expose this in the gateway so that there is a single point of entry for all cluster management operations. See RebalancingService as reference.
Goal:
Decide the request and response format before working on this issue.
The text was updated successfully, but these errors were encountered: