Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic sidecar shall have a docker healthcheck so that placement constraint is placed only if healthy #5837

Closed
Tracked by #5826
sanderegg opened this issue May 16, 2024 · 1 comment
Assignees
Labels
a:dynamic-sidecar dynamic-sidecar service

Comments

@sanderegg
Copy link
Member

As shown in the reference below, the node constrain on the dynamic-sidecar is placed too soon when the dynamic-sidecar is not event started.

@GitHK so after some bug tracking analysis I found the following issues and we can discuss them if you wish.
To follow-up on @matusdrobuliak66 findings I will add here a few things to summarize a bit what is going on.

  1. the user starts a service
  2. the dynamic scheduler somehow starts a dynamic sidecar and since there is no healthcheck defined in the dynamic sidecar, it is instantly reported as "running" by the docker swarm which is expected but is wrong and this is where the bug comes from.
  3. since the dynamic sidecar is deemed as running, the director-v2 sets the placement constraint on the sidecar
  4. after a while the dynamic sidecar fails due to whatever (in this case accessing RabbitMQ/storage or whatever), I agree that this is ok and would work if the machine is fixed
  5. the autoscaling runs every 5 seconds or so, so depending on the user's luck the sidecar might restart during that time and the autoscaling is looking for tasks on the nodes it created, if it does not find any it will set the node to "not ready" by changing a label on the node (this change was made after thorough analysis of @YuryHrytsuk )
  6. so assuming the dynamic sidecar service's task did not restart when the autoscaling monitors the active nodes, it will set it to "not ready" and when the next dynamic sidecar task starts there it is "rejected"
  7. then the dynamic sidecar service will revert to "pending" BUT with a node.id placement constraint
  8. autoscaling will see that guy again, but will skip it because it has the placement constraint (this is per design, as services that explicitly ask for a specific node can never start, whatever the autoscaling would do), therefore the dynamic sidecar will never ever start and remain pending forever
  9. autoscaling will terminate the machine after 3 minutes unless some other service is freshly started

Here the dynamic-sidecar Dockerfile part:

# disabled healthcheck as director-v2 is already taking care of it
# in oder to have similar performance a more aggressive healethcek
# would be required.
# removing the healthchek would not cause any issues at this point
# NOTE: When adding a healthcheck
# - remove UpdateHealth - no longer required
# - remove WaitForSidecarAPI  - no longer required
# - After `get_dynamic_sidecar_placement` inside CreateSidecars call
#   (the sidecar's API will be up and running; guaranteed by the docker engine healthck).
#   Add the following line `scheduler_data.dynamic_sidecar.is_ready = True`
#   The healthcheck guarantees that the API is available
HEALTHCHECK NONE

and here the code from the director-v2

async def _get_task_data_when_service_running(service_id: str) -> Mapping[str, Any]:
        """
        Waits for dynamic-sidecar task to be `running` and returns the
        task data.
        """
        task = await _get_service_latest_task(service_id)
        service_state = task["Status"]["State"]

        if service_state not in TASK_STATES_RUNNING:
            raise TryAgain
        return task

    task = await _get_task_data_when_service_running(service_id=service_id)

These is how it looks like in portainer:
Image
Image
Image
Image

So from here I will create one issue for the dynamic-sidecar. I think it just needs a docker Healthcheck like every other service we use.

I am going to test what happens when I restart a dynamic sidecar on an autoscaled node, as this might also break

Originally posted by @sanderegg in #5826

@sanderegg sanderegg added the a:dynamic-sidecar dynamic-sidecar service label May 16, 2024
@sanderegg sanderegg added this to the Leeroy Jenkins milestone May 16, 2024
@GitHK
Copy link
Contributor

GitHK commented May 17, 2024

@sanderegg so the idea from now on will be to have the http interface up and running before continuing, right?

@GitHK GitHK closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:dynamic-sidecar dynamic-sidecar service
Projects
None yet
Development

No branches or pull requests

2 participants