Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Placement Strategies not working #8836

Open
ChrisJBurns opened this issue Oct 9, 2023 · 5 comments
Open

Container Placement Strategies not working #8836

ChrisJBurns opened this issue Oct 9, 2023 · 5 comments
Labels

Comments

@ChrisJBurns
Copy link

Summary

We have set our container replacement strategies to the following:

concourse:
    web:
        checkContainerPlacementStrategies:
        - limit-active-containers
        - fewest-build-containers
        containerPlacementStrategies:
        - limit-active-containers
        - fewest-build-containers
        noInputContainerPlacementStrategies:
        - limit-active-containers
        - fewest-build-containers

Expected results

We expect containers to be spread across the workers.

Actual results

We get workers with an disproportionate amount of containers. Often resulting in our builds failing because they've hit the container limit even if there are other workers that have plenty of room.

image

Additional context

The default max limit is 250.

Triaging info

  • Concourse version: 7.10.0
  • Browser (if applicable): Chrome
@ChrisJBurns ChrisJBurns added the bug label Oct 9, 2023
@xtremerui
Copy link
Contributor

Just want to make sure if CONCOURSE_MAX_ACTIVE_CONTAINERS_PER_WORKER is set

@ChrisJBurns
Copy link
Author

ChrisJBurns commented Oct 11, 2023

This is on the web deployment:
image

This is on the stateful set for the workers:
image

We are adding this value in the Helm Chart: concourse.web.limitActiveContainers: 200

So it doesn't seem to be honouring the max container limit since the screenshot in the original post had workers with over 200 containers

@mhlic
Copy link

mhlic commented Oct 31, 2023

We are also seeing this behavior with the containers not being spread out, and the limit-active-containers not being respected

@ChrisJBurns
Copy link
Author

@xtremerui Any updates on the above?

@taylorsilva
Copy link
Member

taylorsilva commented Apr 4, 2024

So I have a guess as to why people run into issues like this still. I think it's because FindOrSelectWorker() can be called any number of times in parallel:

func (pool Pool) FindOrSelectWorker(

This function is how each step selects a worker and is the main entry point to container placement. We can see where it's getting called:
image

So if there's no limit on how many get/put/task steps are being run at any given time, I think it's highly likely that you can end up in a situation where there are a bunch of steps trying to find workers at the same time. They all look at the current data available to them and make the same choice which then goes over the limits set by operators and your worker falls over.

My current theory is that putting a rate limit on the number of calls to FindOrSelectWorker() would help with this general "workers failing" issue. It would likely mean steps take a bit longer to initialize, but I'd wager that situation is better than "my entire job failed because the worker fell over".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants