Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject deployment distribute command if deployment already distributed #9912

Open
korthout opened this issue Jul 28, 2022 · 5 comments
Open
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/engine component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog

Comments

@korthout
Copy link
Member

korthout commented Jul 28, 2022

Description

This is a follow-up on

This bug was patched with:

However, we'd like to implement a more comprehensive solution, by rejecting the Deployment DISTRIBUTE command if the deployment was already distributed to that partition. This would allow us to:

  • reap more benefits from the consistency check by avoiding the use of upsert
  • clarify why it's being rejected (ALREADY_EXISTS)

A special case exists where the partition has different knowledge about that distributed deployment compared to what is received in this command. This would be an unexpected situation for which we need to consider how to deal with it. Some potential ideas are:

  • consider it a corrupted data scenario: mark the partition as dead
  • try to remedy the data: overwrite the existing deployment with the new data

My preference would be the safest option: marking the partition as dead due to corrupted data.

@korthout korthout added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Jul 28, 2022
@menski
Copy link
Contributor

menski commented Aug 1, 2022

@korthout and @saig0 will discuss what would be our preferred solution

@npepinpe
Copy link
Member

npepinpe commented Aug 3, 2022

From the ZDT point of view, I'd like to take the opportunity to bring back the idea of flow control for inter partition communication. We could go through the command API, or reuse similar logic, as that's currently the only way a node which is overloaded could tell another to stop sending it so many requests (as occurred during the incident).

@korthout
Copy link
Member Author

korthout commented Aug 9, 2022

@saig0 and I discussed it. We think the rejection is valuable on the partition. However, we don't have strong guarantees on the inter partition communication at this time, so it doesn't make sense to spend time on the special case. A simple look up (is there an entry for key) is enough, there's no need to check the value.

@npepinpe the flow control of inter partition communication is a separate issue. There's a chance we encounter it with #9946, but it might make sense to create a new issue for it if we want to discuss that in more detail.

@npepinpe
Copy link
Member

npepinpe commented Aug 9, 2022

We discussed it in our team meeting and I will write an issue for that anyway. I just haven't gotten around to doing it 😅

@Zelldon
Copy link
Member

Zelldon commented Jan 3, 2023

@npepinpe did you created that issue? If so please link it here :)


Nevermind found it #10087

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/engine component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. scope/broker Marks an issue or PR to appear in the broker section of the changelog
Projects
None yet
Development

No branches or pull requests

5 participants