Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R&D] Improve the recovery of NSM clients #1565

Open
denis-tingaikin opened this issue Dec 11, 2023 · 3 comments
Open

[R&D] Improve the recovery of NSM clients #1565

denis-tingaikin opened this issue Dec 11, 2023 · 3 comments

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Dec 11, 2023

We have done a lot of work to improve the recovery of NSM in the past. Recently we've found a few use-cases that conceptually do not work with existing code. I also noticed that the heal chain element could be simplified.

Problem scenarios

Scenario 1:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill forwarders on both nodes during refresh.
    Actual: no connection established on forwareders re-deployed

Scenario 2:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill any forwarder and any NSMG during the refresh.
    Actual: no connection established on forwareders re-deployed

Scenario 3:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill nse on refresh backward
    Actual: no connection established on forwareders re-deployed

Scenario 4:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill registry during refresh
    Actual: no connection established on forwareders re-deployed

Solution

Rework a few chain elements and reorganize the code.

Changes

  1. heal chain element:
    1.1. simply starts monitoring DP/CP goroutines.
    1.2. If something goes wrong with CP Monitroing, a refresh request should be scheduled.
    1.3. If something goes wrong with DP monitoring, a refresh request with reselection should be scheduled.
  2. retry chain element:
    2.1. should not be a wrapper.
    2.2. should use begin for retries
  3. begin should be able to cancel the current request if a reselect request is called.
  4. RESELECT_REQUESTED state should be removed.
@denis-tingaikin denis-tingaikin changed the title Improve the recovery of NSM clients [WIP] Improve the recovery of NSM clients Dec 16, 2023
@denis-tingaikin denis-tingaikin self-assigned this Dec 18, 2023
@glazychev-art
Copy link
Contributor

glazychev-art commented Jan 11, 2024

Another problem scenario:
Scenario 5:

  1. Create a cluster.
  2. Deploy NSM
  3. Deploy NSC and NSE
  4. Deploy NetSvc
  5. Kill NSE after cmd-nsc-init finished but before cmd-nsc finished
    Actual: connection is not restored by cmd-nsc

Description:
After a connection is established by the init-container, it remains unattended for some time until cmd-nsc sends a request.
During this short time, any event that disrupts the connection can occur.
Apparently, this is the responsibility of the heal chain element, to monitor and restore such connections.

@denis-tingaikin denis-tingaikin changed the title [WIP] Improve the recovery of NSM clients Improve the recovery of NSM clients Jan 24, 2024
@glazychev-art
Copy link
Contributor

We found that for begin.Request() and for eventFactory.Request() we generally use a very similar logic.
Over time, the amount of code increases in two places and eventually we support 2 versions.
We got an idea to use eventFactory.Request() from begin.Request(), rather than using eventFactory.executor directly.

Most likely, this can also help with modifying the retry functionality - we can use it from the eventFactory (see Changes step 2 in the Description)

Question:
May we call Close() for the same connection multiple times?
Or do we consider Close() only as an addition to Request() (that is the begin chain element skips Close() if the connection is not ready)?

@denis-tingaikin denis-tingaikin changed the title Improve the recovery of NSM clients [R&D] Improve the recovery of NSM clients Feb 19, 2024
@glazychev-art
Copy link
Contributor

retry problems

I've considered adding retry after the begin chain element.
And I found a few problems:

  1. retry calls next.Request() on error
    Problem case:
    retry-problem1
    If refresh failed, it starts retries until success or context timeout. So, other calls eventFactory calls will be blocked by executor. What if we need to heal the connection?
  2. retry calls eventFactory on error
    Let's imagine that the first client Request failed.
    retry-Page-3
    There are two problems:
  • we need to pass the initial Request somehow to the eventFactory (it doesn't have networkservice.NetworkServiceRequest argument)
    Request(opts ...Option) <-chan error
  • if schedule the retry arfter error, what should we return to the client?

Possible solution

In my opinion, the problem here is that we are calling eventFactory.Executor from two different places:

I would suggest:

  • using executor only in one place in the eventFactory, and begin will only use the eventFactory (not accessing the private fields) (see also - [R&D] Improve the recovery of NSM clients #1565 (comment))
  • next step is to add retry as an option to eventFactory. This way we will manage retry from one place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Moved to next release
Status: No status
Development

No branches or pull requests

2 participants