New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopping fluent-bit (in firelens configuration) hangs the application container #2787
Comments
Hi @kullu-ashish , I could replicate this issue. I will provide an update shortly about this. Thanks, |
Note: I'm also currently working to replicate the issue. I'd recommend adding an explicit container dependency in your taskdefinition: You can add dependencies for startup and for shutdown. |
I am able to reproduce the issue with the supplied task definition:
|
I also see that, as expected, the agent is trying to stop the remaining container:
And even when I try to stop that Also when I describe the task I see that it is STOPPED with the timeout error:
Also note: when I stop the |
One more clue: I tested with I also tested the
|
Looks like |
Hi There, |
I recommend opening an issue with Docker investigating why is Docker not able to stop the container on |
@kullu-ashish Please feel free to reopen this issue if it needs more investigation from ECS side. |
@kullu-ashish how did you work around this issue at the end? thanks! |
more customer is seeing this, reopening for further investigation |
repro'd the issue following description. in fact this is reproducible even without constantly generating logs. after some investigation, it appears this issue is highly related to moby/moby#40063. A container using awsfirelens logging driver internally uses docker's fluentd logging driver to forward logs to the firelens sidecar container (which runs a fluentd server) [1], and when the firelens container is stopped, the situation is equivalent to the one mentioned in the docker issue: "Docker container hangs when using fluentd logger with fluentd-async-connect=true and unreachable (turned off) fluentd server." in fact, i was able to find exact same stuck goroutine as the one posted in moby/moby#40063 (comment), at the time when the container got stuck. unfortunately, docker has not fixed this issue and doesn't seem to have a plan to rollout a fix. so a path forward to this is TBD. [1] https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/ |
My company is currently evaluating using Firelens with ECS. If fluentbit fails to start for some reason (e.g. misconfiguration), the EC2 instances starts the main container again and again and fills up with stuck containers. This could be a showstopper for us. Is there any fix in sight? |
@fschollmeyer Fluent bit failing to start sounds like a separate issue? My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here? If you're having trouble with Fluent Bit, and you use the AWS Distro, please open an issue here and we will help you: https://github.com/aws/aws-for-fluent-bit
This confuses me a bit- can you elaborate? If you are running in ECS, the EC2 instance itself is not what triggers container restarts. The ECS service can re-schedule tasks if they fail to start. Sounds like that's what's happening in your case? Or are you seeing that the task is restarted but the main container from each task never stops and thus you get more and more instances of the app container? |
yes that's my understanding. if firelens container has never started (e.g. due to misconfiguration), the task will just fail and the other container won't start at all (so they won't stuck either). if that's not the case then that's a separate issue. |
@PettitWesley @fenxiong We had configured a service with a fluentbit-firelens sidecar. For the main container I had added only a START condition as dependency to the sidecar. When fluentbit is misconfigured (e.g. invalid formatting in the injected config file) the following would happen:
Therefore more and more instances of the main container would be started, but not terminated anymore. I modified the container dependency to a HEALTHY one, that should work around this concrete error scenario. Still a fix would be highly appreciated :) |
This issue was not resolved. It was advised as the closing notes from the developer to pass it to docker. But I haven't followed it up after chasing it up for months. |
I was facing something similar. I had the app container marked as essential and the firelens as an auxiliary container. The firelens container was not exiting even after the app exit. I solved it by downgrading the fluentbit AWS image tag to 2.17.0. |
We are waiting for moby/moby#42979 to be merged in order to close this issue. In the meantime, the workaround suggested by @sabaodecoco11 could be used. |
@sabaodecoco11 (and for others watching this)
Unfortunately, we have no reason to believe that changing the version of Fluent Bit will actually impact this issue. It seems to be slightly random so it's possible to run a test where you don't see the behavior. When you use FireLens, the logs go through this path:
The root cause is in the runtime/log driver layer. When Fluent Bit stops, the log driver never closes its connection, and apparently gets stuck retrying forever, and this blocks Docker from stopping the app container. The problem has nothing to do with Fluent Bit. So changing the version of it shouldn't help. As Angel noted, we are closely watching the upstream Docker PR to merge the fix. |
The Docker PR has been merged: moby/moby#42979 (review) We will work to incorporate this in an ECS AMI release. |
Also I finally got around to double checking the Docker PR myself, and it works! I didn't actually run a FireLens task yet, instead I used an ECS Optimized AMI to manually simulate FireLens manually in a scenario I have verified will often trigger the issue on existing docker versions... Custom Docker Build
Verify:
Run Fluent BitConfig:
Run:
Run some containers that use the fluentd log driver
Stop everything in the wrong orderStop Fluent Bit first:
Then, wait about a minute for the logs to pile up in the log driver (in my experience, this seemed to more consistently trigger the issue), and then stop the app containers:
On existing docker versions this just hangs forever, with the new commit, it actually stops the containers. |
Also some bad news, this fix might take quite some time to be released since this is marked as a feature in master for the 21.xx version, not a patch for the existing 20.xx series. No date from Docker community on when that will be. (This is marked in the PR linked in an earlier comment). |
I'm attempting to get the Docker maintainers to agree to backporting this to 20.10 branch which means it could be released much sooner: moby/moby#43147 |
Yay they finally released it! https://docs.docker.com/engine/release-notes/#201013 |
@PettitWesley do you know when the new Docker Engine version will be included in the official ECS AMIs? |
@elisiariocouto Looks like ECS AMI is on 20.10.7: https://github.com/aws/amazon-ecs-ami/blob/main/release.auto.pkrvars.hcl So still too old unfortunately. We need 20.10.13: https://docs.docker.com/engine/release-notes/#201013 |
Hi @elisiariocouto, We've released a new set of ECS AMIs with docker Please re-open this issue if you have any other concerns, thanks! |
@singholt Your comment says 20.10.3 but I think you mean 20.10.13 |
yes thats right |
Summary
Essential container is not stopping if you kill fluent-bit container in the same task definition
Description
I am running a ECS service with 2 containers in the task. One is the main nginx container and other is fluent-bit log-collector. Both containers are essential containers. When the log-collector container is stopped manually, it doesn't stop the nginx container. Checked the ECS agent and docker logs, saw that the SIGTERM and SIGKILL is passed to the main nginx container. The nginx container process on the host machine also stops. But the docker ps shows the nginx container Up. The task as well shows it Running.
Steps to replicate -
Use following task definition -
And then I ran the curl on the HTTPD container so that it can generate constant logs using the following bash script -
while [ 1 ];
do
curl
done
then I did -
docker stop log-collector
The other container stuck in the wired state like customer's.
Expected Behavior
Both containers should stop. docker ps should not show any running containers.
Observed Behavior
Only fluent-bit container stops. Application container(nginx) still seen running with docker ps and even on the ECS console.
Environment Details
Supporting Log Snippets
Logs from the dockerd daemon -
The text was updated successfully, but these errors were encountered: