clustermesh: fix rare panic due to race condition on stop #32513

giorio94 · 2024-05-13T14:06:29Z

The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel.

Fixes: #32179

Fix rare race condition afflicting clustermesh when disconnecting from a remote cluster, possibly causing the agent to panic

The clustermesh logic is currently affected by a possible, although rare, race condition occurring if the cluster configuration is being retrieved while the connection to the remote cluster is stopped. Indeed, this operation stops two controllers -- the one handling the connection to the remote cluster and the one responsible for the retrieval of the cluster config. However, this causes the getRemoteCluster function to possibly terminate before the termination of the second controller, in turn leading to a panic due to send on closed channel. Let's fix this issue by explicitly removing only the first controller, and letting the other terminate normally due to the parent context having been terminated. Hence, ensuring that the controller has always terminated before closing the cfgch channel. Fixes: cilium#32179 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-05-13T15:34:38Z

/test

giorio94 requested a review from a team as a code owner May 13, 2024 14:06

giorio94 requested a review from YutaroHayakawa May 13, 2024 14:06

YutaroHayakawa approved these changes May 16, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 16, 2024

julianwiedmann added this pull request to the merge queue May 16, 2024

Merged via the queue into cilium:main with commit 104a302 May 16, 2024
66 checks passed

YutaroHayakawa mentioned this pull request May 23, 2024

v1.15 Backports 2024-05-24 #32691

Merged

15 tasks

YutaroHayakawa added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels May 23, 2024

YutaroHayakawa mentioned this pull request May 24, 2024

v1.14 Backports 2024-05-24 #32695

Merged

12 tasks

YutaroHayakawa added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustermesh: fix rare panic due to race condition on stop #32513

clustermesh: fix rare panic due to race condition on stop #32513

giorio94 commented May 13, 2024

giorio94 commented May 13, 2024

clustermesh: fix rare panic due to race condition on stop #32513

clustermesh: fix rare panic due to race condition on stop #32513

Conversation

giorio94 commented May 13, 2024

giorio94 commented May 13, 2024