Stopping fluent-bit (in firelens configuration) hangs the application container #2787

kullu-ashish · 2021-01-07T01:14:22Z

Summary

Essential container is not stopping if you kill fluent-bit container in the same task definition

Description

I am running a ECS service with 2 containers in the task. One is the main nginx container and other is fluent-bit log-collector. Both containers are essential containers. When the log-collector container is stopped manually, it doesn't stop the nginx container. Checked the ECS agent and docker logs, saw that the SIGTERM and SIGKILL is passed to the main nginx container. The nginx container process on the host machine also stops. But the docker ps shows the nginx container Up. The task as well shows it Running.

Steps to replicate -
Use following task definition -

{
  "family": "firelens-http",
  "taskRoleArn": "arn:aws:iam::919027951404:role/ECSTaskRoleForCW",
   "containerDefinitions":[
      {
         "name":"continuelogs",
         "image":"httpd",
         "essential":true,
         "logConfiguration": {
				 "logDriver":"awsfirelens",
				 "options": {
					"Name": "cloudwatch",
					"region": "ap-southeast-2",
                                        "log_group_name": "firelens-fluent-bit",
					"auto_create_group": "true",
					"log_stream_prefix": "from-fluent-bit"
                         		}
         },
        "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ],
      "memory": 50,
      "cpu": 102
      },
      {
         "name":"log_router",
         "essential":true,
         "image":"amazon/aws-for-fluent-bit:latest",
         "firelensConfiguration":{
            "type":"fluentbit"
         },
         "logConfiguration":{
            "logDriver":"awslogs",
            "options":{
               "awslogs-group":"firelens-container",
               "awslogs-region":"ap-southeast-2",
               "awslogs-create-group":"true",
               "awslogs-stream-prefix":"firelens"
            }
         },
         "memoryReservation":50
      }
   ]
}

And then I ran the curl on the HTTPD container so that it can generate constant logs using the following bash script -

while [ 1 ];
do
curl
done

then I did -
docker stop log-collector
The other container stuck in the wired state like customer's.

Expected Behavior

Both containers should stop. docker ps should not show any running containers.

Observed Behavior

Only fluent-bit container stops. Application container(nginx) still seen running with docker ps and even on the ECS console.

Environment Details

Supporting Log Snippets

Logs from the dockerd daemon -

Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746807408Z" level=info msg="Container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be failed to exit within 1500 seconds of signal 15 - using the force"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746844525Z" level=debug msg="Sending kill signal 9 to container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750416996Z" level=debug msg="Running health check for container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56 ..."
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750616958Z" level=debug msg="starting exec command 2670ffe75a8b9ae084b0894c9367c86fe0687d4e6660c37a76280c6da8852946 in container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752485060Z" level=debug msg="attach: stdout: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752526920Z" level=debug msg="attach: stderr: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.802146328Z" level=debug msg="Client context cancelled, stop sending events"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.847525798Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal containerd[2972]: time="2020-09-18T00:34:06.864594387Z" level=info msg="shim reaped" id=5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be

The text was updated successfully, but these errors were encountered:

ubhattacharjya · 2021-01-26T00:27:25Z

Hi @kullu-ashish ,

I could replicate this issue. I will provide an update shortly about this.

Thanks,
Utsa

fierlion · 2021-02-01T17:19:35Z

Note: I'm also currently working to replicate the issue.

I'd recommend adding an explicit container dependency in your taskdefinition:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_dependson

You can add dependencies for startup and for shutdown.

fierlion · 2021-02-01T18:33:10Z

I am able to reproduce the issue with the supplied task definition:

$ docker ps
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                PORTS                NAMES
<ID>        httpd                              "httpd-foreground"       9 seconds ago       Up 8 seconds          0.0.0.0:80->80/tcp   ecs-firelens-http-1-continuelogs-d4e3ce9dd7e6ea820300
<ID>        amazon/aws-for-fluent-bit:latest   "/bin/sh -c /entrypo…"   14 seconds ago      Up 13 seconds         2020/tcp             ecs-firelens-http-1-logrouter-9ef0ffc9c8cdbae44100
<ID>        amazon/amazon-ecs-agent:latest     "/agent"                 3 days ago          Up 3 days (healthy)                        ecs-agent

$ curl 0.0.0.0:80
<html><body><h1>It works!</h1></body></html>

$ docker stop <logrouter ID>

$ curl 0.0.0.0:80
curl: (56) Recv failure: Connection reset by peer

$ docker ps

CONTAINER ID        IMAGE                            COMMAND              CREATED             STATUS                PORTS                NAMES
<ID>        httpd                            "httpd-foreground"   11 minutes ago      Up 11 minutes         0.0.0.0:80->80/tcp   ecs-firelens-http-1-continuelogs-d4e3ce9dd7e6ea820300
<ID>        amazon/amazon-ecs-agent:latest   "/agent"             3 days ago          Up 3 days (healthy)                        ecs-agent

fierlion · 2021-02-01T18:59:54Z

I also see that, as expected, the agent is trying to stop the remaining container:

level=info time=2021-02-01T18:27:17Z msg="Task engine [arn:aws:ecs:us-west-2:758282334450:task/execDemo/c5162c766891463cb683a60bb6534a83]: error transitioning container [continuelogs (Runtime ID: c233aff801f49f710b1efc08a3d53356a83dacdbc0c5af7001445236e220f116)] to [STOPPED]: Could not transition to stopped; timed out after waiting 2m30s" module=docker_task_engine.go

And even when I try to stop that httpd container manually using both docker stop and docker kill I am unable to.

Also when I describe the task I see that it is STOPPED with the timeout error:

                    "image": "httpd",
                    "lastStatus": "STOPPED",
                    "exitCode": 255,
                    "reason": "DockerTimeoutError: Could not transition to stopped; timed out after waiting 2m30s",

Also note: when I stop the httpd container first with docker stop , there is no issue. The container exits and the fluent-bit container is also automatically stopped (because the httpd container is marked essential). This issue only occurs when the fluent-bit container is stopped before the httpd container.

fierlion · 2021-02-01T19:51:09Z

One more clue: I tested with polinux/httpd (Apache in a Docker with Supervisor - https://hub.docker.com/r/polinux/httpd) so I could get a shell inside the running container. This image doesn't show the same issue. I can stop the fluent-bit container and polinux/httpd exits as expected (after ~45 seconds).

I also tested the httpd image with the init process enabled, but still saw the image hang.

            "linuxParameters": {
                "initProcessEnabled": true
            },

shubham2892 · 2021-02-10T22:58:10Z

Looks like httpd docker container is hanging, I am not even able to do a docker inspect for that container after stopping the log-router container.
Also, can you check docker logs <container id of httpd container> , for me it was showing an error message in setting up the logs for the container. In addition, can you try polinux/httpd if that resolves the issue as recommended by @fierlion . Thanks.

kullu-ashish · 2021-02-11T04:57:32Z

Hi There,
I need to find the root cause for this issue. What do you recommend to do here -
Should I open an issue with Docker? Or Is it related to fluent-bit?

shubham2892 · 2021-02-11T20:12:59Z

I recommend opening an issue with Docker investigating why is Docker not able to stop the container on docker stop command. This does not look like a fluent-bit issue to me.

shubham2892 · 2021-02-16T17:59:10Z

@kullu-ashish Please feel free to reopen this issue if it needs more investigation from ECS side.

angeldam · 2021-07-15T01:28:05Z

@kullu-ashish how did you work around this issue at the end? thanks!

fenxiong · 2021-09-15T20:20:04Z

more customer is seeing this, reopening for further investigation

fenxiong · 2021-09-17T22:34:52Z

repro'd the issue following description. in fact this is reproducible even without constantly generating logs.

after some investigation, it appears this issue is highly related to moby/moby#40063. A container using awsfirelens logging driver internally uses docker's fluentd logging driver to forward logs to the firelens sidecar container (which runs a fluentd server) [1], and when the firelens container is stopped, the situation is equivalent to the one mentioned in the docker issue: "Docker container hangs when using fluentd logger with fluentd-async-connect=true and unreachable (turned off) fluentd server." in fact, i was able to find exact same stuck goroutine as the one posted in moby/moby#40063 (comment), at the time when the container got stuck.

unfortunately, docker has not fixed this issue and doesn't seem to have a plan to rollout a fix. so a path forward to this is TBD.

[1] https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

fschollmeyer · 2021-09-21T06:57:22Z

My company is currently evaluating using Firelens with ECS. If fluentbit fails to start for some reason (e.g. misconfiguration), the EC2 instances starts the main container again and again and fills up with stuck containers.

This could be a showstopper for us. Is there any fix in sight?

PettitWesley · 2021-09-21T21:43:08Z

@fschollmeyer Fluent bit failing to start sounds like a separate issue?

My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here?

If you're having trouble with Fluent Bit, and you use the AWS Distro, please open an issue here and we will help you: https://github.com/aws/aws-for-fluent-bit

EC2 instances starts the main container again and again and fills up with stuck containers.

This confuses me a bit- can you elaborate? If you are running in ECS, the EC2 instance itself is not what triggers container restarts. The ECS service can re-schedule tasks if they fail to start. Sounds like that's what's happening in your case? Or are you seeing that the task is restarted but the main container from each task never stops and thus you get more and more instances of the app container?

fenxiong · 2021-09-21T21:56:14Z

My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here?

yes that's my understanding. if firelens container has never started (e.g. due to misconfiguration), the task will just fail and the other container won't start at all (so they won't stuck either). if that's not the case then that's a separate issue.

fschollmeyer · 2021-09-24T07:40:53Z

@PettitWesley @fenxiong
Sorry for confusing you guys. Let me explain the issue in detail.

We had configured a service with a fluentbit-firelens sidecar. For the main container I had added only a START condition as dependency to the sidecar.

When fluentbit is misconfigured (e.g. invalid formatting in the injected config file) the following would happen:

Fluentbit starts
Main service starts
Fluentbit fails and terminates
Main container is running
At some point ECS agents tries to stop the main container, however this fails. The task is removed from running tasks in ECS, but it's still running on the machine.
About five minutes later ECS agent starts the next instances, procedure repeats.

Therefore more and more instances of the main container would be started, but not terminated anymore.

I modified the container dependency to a HEALTHY one, that should work around this concrete error scenario. Still a fix would be highly appreciated :)

kullu-ashish · 2021-09-27T00:33:42Z

This issue was not resolved. It was advised as the closing notes from the developer to pass it to docker. But I haven't followed it up after chasing it up for months.

sabaodecoco11 · 2021-09-29T18:54:23Z

I was facing something similar. I had the app container marked as essential and the firelens as an auxiliary container. The firelens container was not exiting even after the app exit. I solved it by downgrading the fluentbit AWS image tag to 2.17.0.

angelcar · 2021-11-15T20:02:36Z

We are waiting for moby/moby#42979 to be merged in order to close this issue.

In the meantime, the workaround suggested by @sabaodecoco11 could be used.

PettitWesley · 2021-12-01T17:05:41Z

@sabaodecoco11 (and for others watching this)

I solved it by downgrading the fluentbit AWS image tag to 2.17.0.

Unfortunately, we have no reason to believe that changing the version of Fluent Bit will actually impact this issue. It seems to be slightly random so it's possible to run a test where you don't see the behavior.

When you use FireLens, the logs go through this path:

App container stdout => Container Runtime (Docker) => Fluentd Docker Log Driver => Unix Socket connection => Fluent Bit

The root cause is in the runtime/log driver layer. When Fluent Bit stops, the log driver never closes its connection, and apparently gets stuck retrying forever, and this blocks Docker from stopping the app container. The problem has nothing to do with Fluent Bit. So changing the version of it shouldn't help.

As Angel noted, we are closely watching the upstream Docker PR to merge the fix.

PettitWesley · 2021-12-03T00:32:55Z

The Docker PR has been merged: moby/moby#42979 (review)

We will work to incorporate this in an ECS AMI release.

PettitWesley · 2021-12-13T20:50:36Z

Also I finally got around to double checking the Docker PR myself, and it works!

I didn't actually run a FireLens task yet, instead I used an ECS Optimized AMI to manually simulate FireLens manually in a scenario I have verified will often trigger the issue on existing docker versions...

Custom Docker Build

# Install go 1.17.4 and then clone Docker and then:

make
sudo systemctl stop docker
sudo cp bundles/binary-daemon/dockerd-dev /usr/bin/dockerd
sudo systemctl start docker

Verify:

$ docker version
Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.15.14
 Git commit:        f0df350
 Built:             Wed Nov 17 03:05:36 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          dev
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.17.3
  Git commit:       f6848ae321
  Built:            Mon Dec 13 19:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0
  GitCommit:        84113eef6fc27af1b01b3181f31bbaf708715301
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

f6848ae321 is the commit in master for the fix.

Run Fluent Bit

Config:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock


[OUTPUT]
    Name stdout
    Match *

Run:

docker run --name fluentbit -it -v $(pwd):/fluent-bit/etc -v /var/run:/var/run public.ecr.aws/aws-observability/aws-for-fluent-bit:latest

Run some containers that use the fluentd log driver

docker run -d --log-driver fluentd --log-opt "fluentd-address=unix:///var/run/fluent.sock" --log-opt "fluentd-async-connect=true"  ubuntu /bin/bash -c -- "while true; do echo "hi" && sleep 1; done;"

Stop everything in the wrong order

Stop Fluent Bit first:

docker stop fluentbit && docker rm fluentbit

Then, wait about a minute for the logs to pile up in the log driver (in my experience, this seemed to more consistently trigger the issue), and then stop the app containers:

docker stop $(docker ps -aq) && docker rm $(docker ps -aq)

On existing docker versions this just hangs forever, with the new commit, it actually stops the containers.

PettitWesley · 2021-12-14T00:52:01Z

Also some bad news, this fix might take quite some time to be released since this is marked as a feature in master for the 21.xx version, not a patch for the existing 20.xx series. No date from Docker community on when that will be. (This is marked in the PR linked in an earlier comment).

PettitWesley · 2022-01-12T04:52:52Z

I'm attempting to get the Docker maintainers to agree to backporting this to 20.10 branch which means it could be released much sooner: moby/moby#43147

PettitWesley · 2022-03-15T20:47:43Z

Yay they finally released it! https://docs.docker.com/engine/release-notes/#201013

elisiariocouto · 2022-04-21T10:00:14Z

@PettitWesley do you know when the new Docker Engine version will be included in the official ECS AMIs?

PettitWesley · 2022-04-27T17:20:36Z

@elisiariocouto Looks like ECS AMI is on 20.10.7: https://github.com/aws/amazon-ecs-ami/blob/main/release.auto.pkrvars.hcl

So still too old unfortunately. We need 20.10.13: https://docs.docker.com/engine/release-notes/#201013

singholt · 2022-05-10T20:30:40Z

Hi @elisiariocouto, We've released a new set of ECS AMIs with docker ~~20.10.3~~ 20.10.13 - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-versions.html.

Please re-open this issue if you have any other concerns, thanks!

PettitWesley · 2022-05-25T18:33:17Z

@singholt Your comment says 20.10.3 but I think you mean 20.10.13

singholt · 2022-05-25T18:35:38Z

yes thats right 20.10.13

shubham2892 closed this as completed Feb 16, 2021

fenxiong reopened this Sep 15, 2021

fenxiong added the kind/bug label Sep 17, 2021

sharanyad mentioned this issue Sep 27, 2021

Missing Cloudwatch metrics for ECS service #2746

Closed

angelcar added the workaround available label Nov 15, 2021

angelcar assigned PettitWesley Nov 15, 2021

PettitWesley mentioned this issue Jan 12, 2022

[20.10 backport] backport fluentd log driver async connect fix moby/moby#43147

Merged

elisiariocouto mentioned this issue Apr 28, 2022

ECS cluster meltdown with Docker bug related to FireLens aws/amazon-ecs-ami#50

Closed

singholt closed this as completed May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopping fluent-bit (in firelens configuration) hangs the application container #2787

Stopping fluent-bit (in firelens configuration) hangs the application container #2787

kullu-ashish commented Jan 7, 2021

ubhattacharjya commented Jan 26, 2021

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

shubham2892 commented Feb 10, 2021

kullu-ashish commented Feb 11, 2021

shubham2892 commented Feb 11, 2021

shubham2892 commented Feb 16, 2021

angeldam commented Jul 15, 2021

fenxiong commented Sep 15, 2021

fenxiong commented Sep 17, 2021 •

edited

fschollmeyer commented Sep 21, 2021

PettitWesley commented Sep 21, 2021 •

edited

fenxiong commented Sep 21, 2021

fschollmeyer commented Sep 24, 2021

kullu-ashish commented Sep 27, 2021

sabaodecoco11 commented Sep 29, 2021

angelcar commented Nov 15, 2021

PettitWesley commented Dec 1, 2021

PettitWesley commented Dec 3, 2021

PettitWesley commented Dec 13, 2021

PettitWesley commented Dec 14, 2021

PettitWesley commented Jan 12, 2022

PettitWesley commented Mar 15, 2022

elisiariocouto commented Apr 21, 2022 •

edited

PettitWesley commented Apr 27, 2022 •

edited

singholt commented May 10, 2022 •

edited

PettitWesley commented May 25, 2022

singholt commented May 25, 2022

Stopping fluent-bit (in firelens configuration) hangs the application container #2787

Stopping fluent-bit (in firelens configuration) hangs the application container #2787

Comments

kullu-ashish commented Jan 7, 2021

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

ubhattacharjya commented Jan 26, 2021

fierlion commented Feb 1, 2021 • edited

fierlion commented Feb 1, 2021 • edited

fierlion commented Feb 1, 2021 • edited

fierlion commented Feb 1, 2021 • edited

shubham2892 commented Feb 10, 2021

kullu-ashish commented Feb 11, 2021

shubham2892 commented Feb 11, 2021

shubham2892 commented Feb 16, 2021

angeldam commented Jul 15, 2021

fenxiong commented Sep 15, 2021

fenxiong commented Sep 17, 2021 • edited

fschollmeyer commented Sep 21, 2021

PettitWesley commented Sep 21, 2021 • edited

fenxiong commented Sep 21, 2021

fschollmeyer commented Sep 24, 2021

kullu-ashish commented Sep 27, 2021

sabaodecoco11 commented Sep 29, 2021

angelcar commented Nov 15, 2021

PettitWesley commented Dec 1, 2021

PettitWesley commented Dec 3, 2021

PettitWesley commented Dec 13, 2021

Custom Docker Build

Run Fluent Bit

Run some containers that use the fluentd log driver

Stop everything in the wrong order

PettitWesley commented Dec 14, 2021

PettitWesley commented Jan 12, 2022

PettitWesley commented Mar 15, 2022

elisiariocouto commented Apr 21, 2022 • edited

PettitWesley commented Apr 27, 2022 • edited

singholt commented May 10, 2022 • edited

PettitWesley commented May 25, 2022

singholt commented May 25, 2022

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

fierlion commented Feb 1, 2021 •

edited

fenxiong commented Sep 17, 2021 •

edited

PettitWesley commented Sep 21, 2021 •

edited

elisiariocouto commented Apr 21, 2022 •

edited

PettitWesley commented Apr 27, 2022 •

edited

singholt commented May 10, 2022 •

edited