New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job of cancelled instance can be activated if an error was thrown on it #8588
Comments
For those who may be curious about the test script, but don't want to download it: original ruby test scriptrequire "zeebe/client"
ZEEBE_ADDRESS = "localhost:26500"
CREDENTIALS = :this_channel_is_insecure
TEST_BPMN_PATH = File.join(__dir__, "one_task.bpmn")
TEST_BPMN_PROCESS_ID = "one_task"
TEST_TASK_TYPE = "do_something"
JOB_TIMEOUT_MS = 1000
REQUEST_TIMEOUT_MS = 10000
LOOPS = 50
BUG_THRESHOLD = 5
include ::Zeebe::Client::GatewayProtocol
gateway = Gateway::Stub.new(ZEEBE_ADDRESS, CREDENTIALS)
# deploy our test process:
puts "deploying process definition from #{TEST_BPMN_PATH}"
deploy_response = gateway.deploy_process(DeployProcessRequest.new(
processes: [
ProcessRequestObject.new(
name: File.basename(TEST_BPMN_PATH),
definition: File.read(TEST_BPMN_PATH)
)
]
))
process_definition_key =
deploy_response.processes.find { |meta| meta["bpmnProcessId"] == TEST_BPMN_PROCESS_ID }["processDefinitionKey"]
puts "deployed process `#{TEST_BPMN_PROCESS_ID}` with key #{process_definition_key}"
# to reproduce the bug, we create some process instances, activate their jobs, throw errors on
# those jobs, then cancel the instances and start over:
instances_created = []
instances_cancelled = []
bug_count = 0
LOOPS.times do
break if bug_count >= BUG_THRESHOLD # break the loop early if the bug has occurred 5 or more times
# create a process instance:
create_instance_response = gateway.create_process_instance(
CreateProcessInstanceRequest.new(bpmnProcessId: TEST_BPMN_PROCESS_ID, version: -1)
)
instance_key = create_instance_response["processInstanceKey"]
puts "created instance #{instance_key}"
instances_created << instance_key
# activate some jobs, from one or more process instances:
activate_jobs_stream = gateway.activate_jobs(
ActivateJobsRequest.new(
type: TEST_TASK_TYPE,
worker: "#{__FILE__}-#{Process.pid}",
maxJobsToActivate: LOOPS,
timeout: JOB_TIMEOUT_MS,
requestTimeout: REQUEST_TIMEOUT_MS,
)
)
activate_jobs_stream.each do |activated_job_response|
activated_job_response.jobs.each do |activated_job|
job_key = activated_job["key"]
job_instance_key = activated_job["processInstanceKey"]
# check the process instance key of each job to see whether it belongs to
# a process instance that we know was previously cancelled:
if instances_cancelled.include?(job_instance_key)
puts "[!!! BUG !!!] activated job #{job_key} for cancelled process instance #{job_instance_key} [!!! BUG !!!]"
bug_count += 1
elsif instances_created.include?(job_instance_key)
puts "activated job #{job_key} for process instance #{job_instance_key}"
# throw an error on this job and make sure we get a successful response back from Zeebe:
throw_error_response = gateway.throw_error(
ThrowErrorRequest.new(
jobKey: job_key,
errorCode: "FOO",
errorMessage: "foobar"
)
)
if throw_error_response.is_a?(ThrowErrorResponse)
puts "threw error on job #{job_key}"
else
"this should be impossible, the gateway should raise a GRPC error if it fails"
end
else
puts "[???] activated job #{job_key} for unknown process instance #{job_instance_key} [???]"
end
end
end
# once the activate jobs call completes, cancel the current process instance, and
# make sure we get a successful response back from Zeebe before recording that we
# cancelled this instance:
cancel_instance_response = gateway.cancel_process_instance(
CancelProcessInstanceRequest.new(processInstanceKey: instance_key)
)
if cancel_instance_response.is_a?(CancelProcessInstanceResponse)
puts "cancelled instance #{instance_key}"
instances_cancelled << instance_key
else
raise "this should be impossible, the gateway should raise a GRPC error if it fails"
end
sleep 0.5
end |
I was able to reproduce it locally, even just running the code once (without a loop). I was also able to reproduce it through zbctl:
This activated the job and returned:
I then checked the state using zdb: Then I looked at the logs, again using zdb. This showed that during the termination of the service task, the unhandled error incident is resolved. Which made me think about that logic, because I remember that the job key is important for the incident resolution logic. What happens is that the job is made activatable again when the incident is resolved. However, we should expect a Cancel Job command while terminating that service task, but this is missing. We only delete the job from the log after a Job Cancelled (or Completed) event is written. LogProduced using zdb:
Looking at this, I think this is pretty old bug, that became worse since 1.1.0. The Job cancel command should always be written during the termination of a service task, but the special incident logic for jobs has already removed the reference to the job from the service task and so it doesn't have any job to cancel. In the past that just meant that the job stayed in the database even though the instance was gone, but since #6000 this job is made activatable again. Fixing the problem Second, we'll also need to consider whether the incident should really be resolved when terminating the service task. Perhaps the incident is merely Third, this makes me think about how much logic we have in event appliers. Why does the incident resolved event lead to the job's state being changed? Wouldn't it be better to write a Job Suspended when the throw error is being processed? We shouldn't make assumption about what to do in applying the events, instead the events that are written should tell us what happened. Of course, this part is not necessary to resolve the issue, but I think it might improve things. |
@npepinpe Let's triage this This happens for jobs where an error has been thrown on it that was not caught in the process and then led to an incident. If the process is cancelled instead of resolving the incident, then the job is made activatable again. This means a worker can pick it up and work on it while there is no process instance anymore. The situation is quite specific, and is more likely to occur during development than in production due to the unhandled error, but it might happen. I'm quite certain the regression is introduced with 1.1.0, because of #6000. |
Well, considering you've already root caused this... 😄 To be clear, what happens when it's activated again and then completed/failed? Not much, I guess? So it's mostly taking up resources for no reason, correct? |
I'd need to check what happens internally, but there's not much the engine can do with it. However, the worker would have executed its work for that job and from the perspective of orchestration this is pretty bad. |
What's your estimate on the severity? I would class it as high, and have us work on this soon. You're correct that the impact is quite bad if applications perform work that was cancelled. Of course the impact depends on the application's use case, but it's critical that Zeebe remains trusted as an orchestration engine. |
As you already looked into it, can you work on it? Do you think it's something we will be able to get into 1.3.1? I know you're busy next week... |
I should be able to find some time for it in week 4, but the solution ideas I describe above are still pretty vague. So I'm not sure it will be done in time for 1.3.1. |
I might downgrade it to Mid, because it requires throwing an uncaught error in the process, which is less likely to happen in production systems than in development systems. Still the impact can be big if it does happen as you say: it's critical that Zeebe remains trusted as an orchestration engine |
So we assume everyone will test their processes thoroughly before deploying to production? 😉 I get your point, but as the tooling for testing is very fresh, I don't know how likely it is, and when it does happen it's still pretty bad. I would keep it as high for now, especially since it's not clear how quickly users would realize this is happening in a production system to fix it (and Zeebe wouldn't automatically correct it). It's fine if we miss 1.3.1, it was more hopeful than anything 🙂 |
I updated the description with @korthout's vastly-simpler method for reproducing the issue. My original method is tucked away in a collapsible section there if anyone is looking for it. |
At present this problem has a great impact on us and has received many user complaints. |
Hi @jerry123888 thanks for raising your concern. I've had to switch focus from this issue and it wasn't on my mind anymore. I'll assign myself to keep it at the top of my tasks. This will not make the upcoming release, but I expect that we can find some time to fix this in an upcoming patch release. |
This adds a regression test against activating jobs of cancelled process instances, by making sure they are cancelled (i.e. deleted). In this specific case, an uncaught error was thrown on a job which led to an UNHANDLED_ERROR_EVENT incident. When the process instance is cancelled, the job was not called, but instead was made activatable again. This happened, because the incident was resolved as part of the process instance termination logic. Instead, we should guarantee that the job is cancelled. A cancelled job can no longer be made re-activatable. This is a terminal state for jobs, because a cancelled job is a deleted job. See #8588 for more details.
This adds a regression test against activating jobs of cancelled process instances, by making sure they are cancelled (i.e. deleted). In this specific case, an uncaught error was thrown on a job which led to an UNHANDLED_ERROR_EVENT incident. When the process instance is cancelled, the job was not called, but instead was made activatable again. This happened, because the incident was resolved as part of the process instance termination logic. Instead, we should guarantee that the job is cancelled. A cancelled job can no longer be made re-activatable. This is a terminal state for jobs, because a cancelled job is a deleted job. See #8588 for more details.
9219: Cancel job with incident when canceling the process instance r=korthout a=korthout ## Description This fixes a bug where canceling a process instance could lead to a job becoming activatable even though the process instance is terminated. See #8588 for more details. The bug is fixed by making sure the job is immediately canceled (i.e. deleted from the state) when we terminate the related element instance. We achieve this by writing a Job Canceled event when the TERMINATE_ELEMENT command is being processed. This PR fixes the bug in a way that can be backported because it doesn't introduce too many changes. Alternatively, the bug could be fixed by tackling the main issue: applying the Incident Resolved event should not affect the state of a job. Instead, the job should be made activatable by writing a separate Job event in the Resolve Incident Processor. That would be a larger change, with more impact on the event appliers than what this PR proposes. Changes to event appliers should generally be avoided. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #8588 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
This adds a regression test against activating jobs of cancelled process instances, by making sure they are cancelled (i.e. deleted). In this specific case, an uncaught error was thrown on a job which led to an UNHANDLED_ERROR_EVENT incident. When the process instance is cancelled, the job was not called, but instead was made activatable again. This happened, because the incident was resolved as part of the process instance termination logic. Instead, we should guarantee that the job is cancelled. A cancelled job can no longer be made re-activatable. This is a terminal state for jobs, because a cancelled job is a deleted job. See #8588 for more details. (cherry picked from commit d43c344)
This adds a regression test against activating jobs of cancelled process instances, by making sure they are cancelled (i.e. deleted). In this specific case, an uncaught error was thrown on a job which led to an UNHANDLED_ERROR_EVENT incident. When the process instance is cancelled, the job was not called, but instead was made activatable again. This happened, because the incident was resolved as part of the process instance termination logic. Instead, we should guarantee that the job is cancelled. A cancelled job can no longer be made re-activatable. This is a terminal state for jobs, because a cancelled job is a deleted job. See #8588 for more details. (cherry picked from commit d43c344)
9250: [Backport 1.3] Cancel job with incident when canceling the process instance r=korthout a=korthout ## Description <!-- Link to the PR that is back ported --> Backport of #9219 ## Related issues <!-- Link to the related issues of the origin PR --> closes #8588 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
9251: [Backport 8.0] Cancel job with incident when canceling the process instance r=korthout a=korthout ## Description <!-- Link to the PR that is back ported --> Backport of #9219 ## Related issues <!-- Link to the related issues of the origin PR --> closes #8588 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
9251: [Backport 8.0] Cancel job with incident when canceling the process instance r=korthout a=korthout ## Description <!-- Link to the PR that is back ported --> Backport of #9219 ## Related issues <!-- Link to the related issues of the origin PR --> closes #8588 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@rusterholz The issue is fixed. The fix will be available in the upcoming patch releases 8.0.2 and 1.3.8. |
Describe the Bug
A job can still be activated (that is, Zeebe will return it as part of an ActivateJobsResponse) even after its associated process instance has been cancelled, if an error was previously thrown on that job.
To Reproduce
@korthout's method is easier than mine, especially for Camunda folks:
My original method
ZIP file: reproduce_bug.zip
reproduce_bug.rb
andone_task.bpmn
files, and put them in a directory together.localhost:26500
, edit thereproduce_bug.rb
file.ruby reproduce_bug.rb
and observe the output.The script will deploy the included BPMN, create an instance of that process, and activate jobs for that process instance. If the jobs that Zeebe returns are for a running process instance, it will throw an error on them. It will then cancel that process instance and loop around and create another instance. If the jobs that it receives from Zeebe are from a process instance that was already successfully cancelled, it will write a line of output including "[!!! BUG !!!]". The script will exit after 5 occurrences of the bug or 50 loops, whichever happens first.
Expected Behavior
After Zeebe successfully cancels a process instance, no jobs from that instance should ever be returned to a future ActivateJobs request.
Actual Behavior
Jobs which previously had an error thrown on them before their process instance was cancelled are being returned to ActivateJobs requests that are placed after their process instance was cancelled.
Log/Stacktrace
My original method
Here is the output of a sample run of the
reproduce_bug.rb
script:Environment:
The text was updated successfully, but these errors were encountered: