[R&D] Improve the recovery of NSM clients #1565

denis-tingaikin · 2023-12-11T21:04:09Z

We have done a lot of work to improve the recovery of NSM in the past. Recently we've found a few use-cases that conceptually do not work with existing code. I also noticed that the heal chain element could be simplified.

Problem scenarios

Scenario 1:

Create a cluster with two nodes.
Deploy NSM
Deploy NSC without DP healing on node 1.
Deploy NSE on node 2.
Deploy NetSvc
Kill forwarders on both nodes during refresh.
Actual: no connection established on forwareders re-deployed

Scenario 2:

Create a cluster with two nodes.
Deploy NSM
Deploy NSC without DP healing on node 1.
Deploy NSE on node 2.
Deploy NetSvc
Kill any forwarder and any NSMG during the refresh.
Actual: no connection established on forwareders re-deployed

Scenario 3:

Create a cluster with two nodes.
Deploy NSM
Deploy NSC without DP healing on node 1.
Deploy NSE on node 2.
Deploy NetSvc
Kill nse on refresh backward
Actual: no connection established on forwareders re-deployed

Scenario 4:

Create a cluster with two nodes.
Deploy NSM
Deploy NSC without DP healing on node 1.
Deploy NSE on node 2.
Deploy NetSvc
Kill registry during refresh
Actual: no connection established on forwareders re-deployed

Solution

Rework a few chain elements and reorganize the code.

Changes

heal chain element:
1.1. simply starts monitoring DP/CP goroutines.
1.2. If something goes wrong with CP Monitroing, a refresh request should be scheduled.
1.3. If something goes wrong with DP monitoring, a refresh request with reselection should be scheduled.
retry chain element:
2.1. should not be a wrapper.
2.2. should use begin for retries
begin should be able to cancel the current request if a reselect request is called.
RESELECT_REQUESTED state should be removed.

The text was updated successfully, but these errors were encountered:

glazychev-art · 2024-01-11T08:02:38Z

Another problem scenario:
Scenario 5:

Create a cluster.
Deploy NSM
Deploy NSC and NSE
Deploy NetSvc
Kill NSE after cmd-nsc-init finished but before cmd-nsc finished
Actual: connection is not restored by cmd-nsc

Description:
After a connection is established by the init-container, it remains unattended for some time until cmd-nsc sends a request.
During this short time, any event that disrupts the connection can occur.
Apparently, this is the responsibility of the heal chain element, to monitor and restore such connections.

glazychev-art · 2024-01-30T13:22:12Z

We found that for begin.Request() and for eventFactory.Request() we generally use a very similar logic.
Over time, the amount of code increases in two places and eventually we support 2 versions.
We got an idea to use eventFactory.Request() from begin.Request(), rather than using eventFactory.executor directly.

Most likely, this can also help with modifying the retry functionality - we can use it from the eventFactory (see Changes step 2 in the Description)

Question:
May we call Close() for the same connection multiple times?
Or do we consider Close() only as an addition to Request() (that is the begin chain element skips Close() if the connection is not ready)?

glazychev-art · 2024-02-20T07:31:03Z

retry problems

I've considered adding retry after the begin chain element.
And I found a few problems:

retry calls next.Request() on error
Problem case:

If refresh failed, it starts retries until success or context timeout. So, other calls eventFactory calls will be blocked by executor. What if we need to heal the connection?
retry calls eventFactory on error
Let's imagine that the first client Request failed.

There are two problems:

we need to pass the initial Request somehow to the eventFactory (it doesn't have networkservice.NetworkServiceRequest argument)

sdk/pkg/networkservice/common/begin/event_factory.go

Line 44 in 53a8ec8

Request(opts ...Option) <-chan error
if schedule the retry arfter error, what should we return to the client?

Possible solution

In my opinion, the problem here is that we are calling eventFactory.Executor from two different places:

from begin - https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/client.go#L60-L87
from eventFactory - https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/event_factory.go#L93-L122

I would suggest:

using executor only in one place in the eventFactory, and begin will only use the eventFactory (not accessing the private fields) (see also - [R&D] Improve the recovery of NSM clients #1565 (comment))
next step is to add retry as an option to eventFactory. This way we will manage retry from one place

denis-tingaikin changed the title ~~Improve the recovery of NSM clients~~ [WIP] Improve the recovery of NSM clients Dec 16, 2023

denis-tingaikin self-assigned this Dec 18, 2023

denis-tingaikin changed the title ~~[WIP] Improve the recovery of NSM clients~~ Improve the recovery of NSM clients Jan 24, 2024

denis-tingaikin assigned glazychev-art and unassigned denis-tingaikin Jan 24, 2024

denis-tingaikin mentioned this issue Jan 30, 2024

Investigate why forwarder could restart too long networkservicemesh/cmd-forwarder-vpp#664

Open

This was referenced Feb 5, 2024

Add retry attempts to refresh #1583

Merged

[WIP] begin should use eventFactory #1580

Draft

denis-tingaikin changed the title ~~Improve the recovery of NSM clients~~ [R&D] Improve the recovery of NSM clients Feb 19, 2024

denis-tingaikin unassigned glazychev-art Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R&D] Improve the recovery of NSM clients #1565

[R&D] Improve the recovery of NSM clients #1565

denis-tingaikin commented Dec 11, 2023 •

edited

glazychev-art commented Jan 11, 2024 •

edited

glazychev-art commented Jan 30, 2024

glazychev-art commented Feb 20, 2024

[R&D] Improve the recovery of NSM clients #1565

[R&D] Improve the recovery of NSM clients #1565

Comments

denis-tingaikin commented Dec 11, 2023 • edited

Problem scenarios

Solution

Changes

glazychev-art commented Jan 11, 2024 • edited

glazychev-art commented Jan 30, 2024

glazychev-art commented Feb 20, 2024

retry problems

Possible solution

denis-tingaikin commented Dec 11, 2023 •

edited

glazychev-art commented Jan 11, 2024 •

edited