Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping fluent-bit (in firelens configuration) hangs the application container #2787

Closed
kullu-ashish opened this issue Jan 7, 2021 · 30 comments

Comments

@kullu-ashish
Copy link

Summary

Essential container is not stopping if you kill fluent-bit container in the same task definition

Description

I am running a ECS service with 2 containers in the task. One is the main nginx container and other is fluent-bit log-collector. Both containers are essential containers. When the log-collector container is stopped manually, it doesn't stop the nginx container. Checked the ECS agent and docker logs, saw that the SIGTERM and SIGKILL is passed to the main nginx container. The nginx container process on the host machine also stops. But the docker ps shows the nginx container Up. The task as well shows it Running.

Steps to replicate -
Use following task definition -

{
  "family": "firelens-http",
  "taskRoleArn": "arn:aws:iam::919027951404:role/ECSTaskRoleForCW",
   "containerDefinitions":[
      {
         "name":"continuelogs",
         "image":"httpd",
         "essential":true,
         "logConfiguration": {
				 "logDriver":"awsfirelens",
				 "options": {
					"Name": "cloudwatch",
					"region": "ap-southeast-2",
                                        "log_group_name": "firelens-fluent-bit",
					"auto_create_group": "true",
					"log_stream_prefix": "from-fluent-bit"
                         		}
         },
        "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ],
      "memory": 50,
      "cpu": 102
      },
      {
         "name":"log_router",
         "essential":true,
         "image":"amazon/aws-for-fluent-bit:latest",
         "firelensConfiguration":{
            "type":"fluentbit"
         },
         "logConfiguration":{
            "logDriver":"awslogs",
            "options":{
               "awslogs-group":"firelens-container",
               "awslogs-region":"ap-southeast-2",
               "awslogs-create-group":"true",
               "awslogs-stream-prefix":"firelens"
            }
         },
         "memoryReservation":50
      }
   ]
}

And then I ran the curl on the HTTPD container so that it can generate constant logs using the following bash script -

while [ 1 ];
do
curl
done

then I did -
docker stop log-collector
The other container stuck in the wired state like customer's.

Expected Behavior

Both containers should stop. docker ps should not show any running containers.

Observed Behavior

Only fluent-bit container stops. Application container(nginx) still seen running with docker ps and even on the ECS console.

Environment Details

Supporting Log Snippets

Logs from the dockerd daemon -

Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746807408Z" level=info msg="Container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be failed to exit within 1500 seconds of signal 15 - using the force"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746844525Z" level=debug msg="Sending kill signal 9 to container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750416996Z" level=debug msg="Running health check for container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56 ..."
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750616958Z" level=debug msg="starting exec command 2670ffe75a8b9ae084b0894c9367c86fe0687d4e6660c37a76280c6da8852946 in container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752485060Z" level=debug msg="attach: stdout: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752526920Z" level=debug msg="attach: stderr: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.802146328Z" level=debug msg="Client context cancelled, stop sending events"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.847525798Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal containerd[2972]: time="2020-09-18T00:34:06.864594387Z" level=info msg="shim reaped" id=5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be
@ubhattacharjya
Copy link
Contributor

Hi @kullu-ashish ,

I could replicate this issue. I will provide an update shortly about this.

Thanks,
Utsa

@fierlion
Copy link
Member

fierlion commented Feb 1, 2021

Note: I'm also currently working to replicate the issue.

I'd recommend adding an explicit container dependency in your taskdefinition:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_dependson

You can add dependencies for startup and for shutdown.

@fierlion
Copy link
Member

fierlion commented Feb 1, 2021

I am able to reproduce the issue with the supplied task definition:

$ docker ps
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                PORTS                NAMES
<ID>        httpd                              "httpd-foreground"       9 seconds ago       Up 8 seconds          0.0.0.0:80->80/tcp   ecs-firelens-http-1-continuelogs-d4e3ce9dd7e6ea820300
<ID>        amazon/aws-for-fluent-bit:latest   "/bin/sh -c /entrypo…"   14 seconds ago      Up 13 seconds         2020/tcp             ecs-firelens-http-1-logrouter-9ef0ffc9c8cdbae44100
<ID>        amazon/amazon-ecs-agent:latest     "/agent"                 3 days ago          Up 3 days (healthy)                        ecs-agent

$ curl 0.0.0.0:80
<html><body><h1>It works!</h1></body></html>

$ docker stop <logrouter ID>

$ curl 0.0.0.0:80
curl: (56) Recv failure: Connection reset by peer

$ docker ps

CONTAINER ID        IMAGE                            COMMAND              CREATED             STATUS                PORTS                NAMES
<ID>        httpd                            "httpd-foreground"   11 minutes ago      Up 11 minutes         0.0.0.0:80->80/tcp   ecs-firelens-http-1-continuelogs-d4e3ce9dd7e6ea820300
<ID>        amazon/amazon-ecs-agent:latest   "/agent"             3 days ago          Up 3 days (healthy)                        ecs-agent

@fierlion
Copy link
Member

fierlion commented Feb 1, 2021

I also see that, as expected, the agent is trying to stop the remaining container:

level=info time=2021-02-01T18:27:17Z msg="Task engine [arn:aws:ecs:us-west-2:758282334450:task/execDemo/c5162c766891463cb683a60bb6534a83]: error transitioning container [continuelogs (Runtime ID: c233aff801f49f710b1efc08a3d53356a83dacdbc0c5af7001445236e220f116)] to [STOPPED]: Could not transition to stopped; timed out after waiting 2m30s" module=docker_task_engine.go

And even when I try to stop that httpd container manually using both docker stop and docker kill I am unable to.

Also when I describe the task I see that it is STOPPED with the timeout error:

                    "image": "httpd",
                    "lastStatus": "STOPPED",
                    "exitCode": 255,
                    "reason": "DockerTimeoutError: Could not transition to stopped; timed out after waiting 2m30s",

Also note: when I stop the httpd container first with docker stop , there is no issue. The container exits and the fluent-bit container is also automatically stopped (because the httpd container is marked essential). This issue only occurs when the fluent-bit container is stopped before the httpd container.

@fierlion
Copy link
Member

fierlion commented Feb 1, 2021

One more clue: I tested with polinux/httpd (Apache in a Docker with Supervisor - https://hub.docker.com/r/polinux/httpd) so I could get a shell inside the running container. This image doesn't show the same issue. I can stop the fluent-bit container and polinux/httpd exits as expected (after ~45 seconds).

I also tested the httpd image with the init process enabled, but still saw the image hang.

            "linuxParameters": {
                "initProcessEnabled": true
            },

@shubham2892
Copy link
Contributor

Looks like httpd docker container is hanging, I am not even able to do a docker inspect for that container after stopping the log-router container.
Also, can you check docker logs <container id of httpd container> , for me it was showing an error message in setting up the logs for the container. In addition, can you try polinux/httpd if that resolves the issue as recommended by @fierlion . Thanks.

@kullu-ashish
Copy link
Author

Hi There,
I need to find the root cause for this issue. What do you recommend to do here -
Should I open an issue with Docker? Or Is it related to fluent-bit?

@shubham2892
Copy link
Contributor

I recommend opening an issue with Docker investigating why is Docker not able to stop the container on docker stop command. This does not look like a fluent-bit issue to me.

@shubham2892
Copy link
Contributor

@kullu-ashish Please feel free to reopen this issue if it needs more investigation from ECS side.

@angeldam
Copy link

@kullu-ashish how did you work around this issue at the end? thanks!

@fenxiong
Copy link
Contributor

more customer is seeing this, reopening for further investigation

@fenxiong fenxiong reopened this Sep 15, 2021
@fenxiong
Copy link
Contributor

fenxiong commented Sep 17, 2021

repro'd the issue following description. in fact this is reproducible even without constantly generating logs.

after some investigation, it appears this issue is highly related to moby/moby#40063. A container using awsfirelens logging driver internally uses docker's fluentd logging driver to forward logs to the firelens sidecar container (which runs a fluentd server) [1], and when the firelens container is stopped, the situation is equivalent to the one mentioned in the docker issue: "Docker container hangs when using fluentd logger with fluentd-async-connect=true and unreachable (turned off) fluentd server." in fact, i was able to find exact same stuck goroutine as the one posted in moby/moby#40063 (comment), at the time when the container got stuck.

unfortunately, docker has not fixed this issue and doesn't seem to have a plan to rollout a fix. so a path forward to this is TBD.

[1] https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/

@fschollmeyer
Copy link

My company is currently evaluating using Firelens with ECS. If fluentbit fails to start for some reason (e.g. misconfiguration), the EC2 instances starts the main container again and again and fills up with stuck containers.

This could be a showstopper for us. Is there any fix in sight?

@PettitWesley
Copy link
Contributor

PettitWesley commented Sep 21, 2021

@fschollmeyer Fluent bit failing to start sounds like a separate issue?

My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here?

If you're having trouble with Fluent Bit, and you use the AWS Distro, please open an issue here and we will help you: https://github.com/aws/aws-for-fluent-bit

EC2 instances starts the main container again and again and fills up with stuck containers.

This confuses me a bit- can you elaborate? If you are running in ECS, the EC2 instance itself is not what triggers container restarts. The ECS service can re-schedule tasks if they fail to start. Sounds like that's what's happening in your case? Or are you seeing that the task is restarted but the main container from each task never stops and thus you get more and more instances of the app container?

@fenxiong
Copy link
Contributor

My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here?

yes that's my understanding. if firelens container has never started (e.g. due to misconfiguration), the task will just fail and the other container won't start at all (so they won't stuck either). if that's not the case then that's a separate issue.

@fschollmeyer
Copy link

@PettitWesley @fenxiong
Sorry for confusing you guys. Let me explain the issue in detail.

We had configured a service with a fluentbit-firelens sidecar. For the main container I had added only a START condition as dependency to the sidecar.

When fluentbit is misconfigured (e.g. invalid formatting in the injected config file) the following would happen:

  • Fluentbit starts
  • Main service starts
  • Fluentbit fails and terminates
  • Main container is running
  • At some point ECS agents tries to stop the main container, however this fails. The task is removed from running tasks in ECS, but it's still running on the machine.
  • About five minutes later ECS agent starts the next instances, procedure repeats.

Therefore more and more instances of the main container would be started, but not terminated anymore.

I modified the container dependency to a HEALTHY one, that should work around this concrete error scenario. Still a fix would be highly appreciated :)

@kullu-ashish
Copy link
Author

This issue was not resolved. It was advised as the closing notes from the developer to pass it to docker. But I haven't followed it up after chasing it up for months.

@sabaodecoco11
Copy link

I was facing something similar. I had the app container marked as essential and the firelens as an auxiliary container. The firelens container was not exiting even after the app exit. I solved it by downgrading the fluentbit AWS image tag to 2.17.0.

@angelcar
Copy link
Contributor

We are waiting for moby/moby#42979 to be merged in order to close this issue.

In the meantime, the workaround suggested by @sabaodecoco11 could be used.

@PettitWesley
Copy link
Contributor

@sabaodecoco11 (and for others watching this)

I solved it by downgrading the fluentbit AWS image tag to 2.17.0.

Unfortunately, we have no reason to believe that changing the version of Fluent Bit will actually impact this issue. It seems to be slightly random so it's possible to run a test where you don't see the behavior.

When you use FireLens, the logs go through this path:

App container stdout => Container Runtime (Docker) => Fluentd Docker Log Driver => Unix Socket connection => Fluent Bit

The root cause is in the runtime/log driver layer. When Fluent Bit stops, the log driver never closes its connection, and apparently gets stuck retrying forever, and this blocks Docker from stopping the app container. The problem has nothing to do with Fluent Bit. So changing the version of it shouldn't help.

As Angel noted, we are closely watching the upstream Docker PR to merge the fix.

@PettitWesley
Copy link
Contributor

The Docker PR has been merged: moby/moby#42979 (review)

We will work to incorporate this in an ECS AMI release.

@PettitWesley
Copy link
Contributor

Also I finally got around to double checking the Docker PR myself, and it works!

I didn't actually run a FireLens task yet, instead I used an ECS Optimized AMI to manually simulate FireLens manually in a scenario I have verified will often trigger the issue on existing docker versions...

Custom Docker Build

# Install go 1.17.4 and then clone Docker and then:

make
sudo systemctl stop docker
sudo cp bundles/binary-daemon/dockerd-dev /usr/bin/dockerd
sudo systemctl start docker

Verify:

$ docker version
Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.15.14
 Git commit:        f0df350
 Built:             Wed Nov 17 03:05:36 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          dev
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.17.3
  Git commit:       f6848ae321
  Built:            Mon Dec 13 19:54:50 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0
  GitCommit:        84113eef6fc27af1b01b3181f31bbaf708715301
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

f6848ae321 is the commit in master for the fix.

Run Fluent Bit

Config:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock


[OUTPUT]
    Name stdout
    Match *

Run:

docker run --name fluentbit -it -v $(pwd):/fluent-bit/etc -v /var/run:/var/run public.ecr.aws/aws-observability/aws-for-fluent-bit:latest

Run some containers that use the fluentd log driver

docker run -d --log-driver fluentd --log-opt "fluentd-address=unix:///var/run/fluent.sock" --log-opt "fluentd-async-connect=true"  ubuntu /bin/bash -c -- "while true; do echo "hi" && sleep 1; done;"

Stop everything in the wrong order

Stop Fluent Bit first:

docker stop fluentbit && docker rm fluentbit

Then, wait about a minute for the logs to pile up in the log driver (in my experience, this seemed to more consistently trigger the issue), and then stop the app containers:

docker stop $(docker ps -aq) && docker rm $(docker ps -aq) 

On existing docker versions this just hangs forever, with the new commit, it actually stops the containers.

@PettitWesley
Copy link
Contributor

Also some bad news, this fix might take quite some time to be released since this is marked as a feature in master for the 21.xx version, not a patch for the existing 20.xx series. No date from Docker community on when that will be. (This is marked in the PR linked in an earlier comment).

@PettitWesley
Copy link
Contributor

I'm attempting to get the Docker maintainers to agree to backporting this to 20.10 branch which means it could be released much sooner: moby/moby#43147

@PettitWesley
Copy link
Contributor

Yay they finally released it! https://docs.docker.com/engine/release-notes/#201013

@elisiariocouto
Copy link

elisiariocouto commented Apr 21, 2022

@PettitWesley do you know when the new Docker Engine version will be included in the official ECS AMIs?

@PettitWesley
Copy link
Contributor

PettitWesley commented Apr 27, 2022

@elisiariocouto Looks like ECS AMI is on 20.10.7: https://github.com/aws/amazon-ecs-ami/blob/main/release.auto.pkrvars.hcl

So still too old unfortunately. We need 20.10.13: https://docs.docker.com/engine/release-notes/#201013

@singholt
Copy link
Contributor

singholt commented May 10, 2022

Hi @elisiariocouto, We've released a new set of ECS AMIs with docker 20.10.3 20.10.13 - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-versions.html.

Please re-open this issue if you have any other concerns, thanks!

@PettitWesley
Copy link
Contributor

@singholt Your comment says 20.10.3 but I think you mean 20.10.13

@singholt
Copy link
Contributor

yes thats right 20.10.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests