[BUG] agent Dockerfile started crashing #23048

modosc · 2024-02-21T22:09:50Z

Agent Environment
we're using public.ecr.aws/datadog/agent:latest in a sidecar container and deploying to ecs.
we also have the following configuration setup via env variables:

        ECS_FARGATE: "true",
        DD_APM_ENABLED: "true",
        DD_LOGS_ENABLED: "true",
        DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL: "true",
        DD_CONTAINER_EXCLUDE: "name:datadog-agent",
        DD_TAGS: "env:${DD_ENV},stage:${stage},version:${DD_VERSION}",
        DD_APM_IGNORE_RESOURCES: "SomeResourceNamesHere",

Describe what happened:
this has worked fine for at least 6 months. since ~2024-02-18 we've started seeing these failures when attempting to deploy:

DATADOG ERROR - TRACING - Agent Error: {"agent_error":"Datadog::Core::Transport::InternalErrorResponse ok?: unsupported?:, not_found?:, client_error?:, server_error?:, internal_error?:true, payload:, error_type:Errno::ECONNREFUSED error:Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for \"127.0.0.1\" port 8126)"}

this causes our deploys to fail. re-running usually resolves this, although today these have happened more and more frequently.

here's a full log entry:

I, [2024-02-21T22:01:48.200466 #1]  INFO -- ddtrace: [ddtrace] DATADOG CONFIGURATION - TRACING - {"enabled":true,"agent_url":"http://127.0.0.1:8126?timeout=30","analytics_enabled":false,"sample_rate":null,"sampling_rules":null,"integrations_loaded":"action_mailer@7.1.3,action_cable@7.1.3,rails@7.1.3,faraday@2.9.0,rack@2.2.8,active_support@7.1.3,action_pack@7.1.3,action_view@7.1.3,active_job@7.1.3,active_record@7.1.3","partial_flushing_enabled":false,"priority_sampling_enabled":false,"integration_action_mailer_analytics_enabled":"false","integration_action_mailer_analytics_sample_rate":"1.0","integration_action_mailer_enabled":"true","integration_action_mailer_service_name":"","integration_action_mailer_email_data":"false","integration_action_cable_analytics_enabled":"false","integration_action_cable_analytics_sample_rate":"1.0","integration_action_cable_enabled":"true","integration_action_cable_service_name":"","integration_rails_analytics_enabled":"","integration_rails_analytics_sample_rate":"1.0","integration_rails_enabled":"true","integration_rails_service_name":"","integration_rails_distributed_tracing":"true","integration_rails_request_queuing":"false","integration_rails_exception_controller":"","integration_rails_middleware":"true","integration_rails_middleware_names":"false","integration_rails_template_base_path":"views/","integration_faraday_analytics_enabled":"false","integration_faraday_analytics_sample_rate":"1.0","integration_faraday_enabled":"true","integration_faraday_service_name":"faraday","integration_faraday_distributed_tracing":"true","integration_faraday_error_handler":"#\u003cProc:0x00007effb3b5cf20 /rails/vendor/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/tracing/contrib/faraday/configuration/settings.rb:14 (lambda)\u003e","integration_faraday_on_error":"","integration_faraday_split_by_domain":"true","integration_faraday_peer_service":"","integration_rack_analytics_enabled":"","integration_rack_analytics_sample_rate":"1.0","integration_rack_enabled":"true","integration_rack_service_name":"","integration_rack_application":"#\u003cHorizonApi::Application:0x00007effb49ee9d0\u003e","integration_rack_distributed_tracing":"true","integration_rack_headers":"{:response=\u003e[\"Content-Type\", \"X-Request-ID\"]}","integration_rack_middleware_names":"false","integration_rack_quantize":"{}","integration_rack_request_queuing":"false","integration_rack_web_service_name":"web-server","integration_active_support_analytics_enabled":"false","integration_active_support_analytics_sample_rate":"1.0","integration_active_support_enabled":"true","integration_active_support_service_name":"","integration_active_support_cache_service":"active_support-cache","integration_action_pack_analytics_enabled":"","integration_action_pack_analytics_sample_rate":"1.0","integration_action_pack_enabled":"true","integration_action_pack_service_name":"","integration_action_pack_exception_controller":"","integration_action_view_analytics_enabled":"false","integration_action_view_analytics_sample_rate":"1.0","integration_action_view_enabled":"true","integration_action_view_service_name":"","integration_action_view_template_base_path":"views/","integration_active_job_analytics_enabled":"false","integration_active_job_analytics_sample_rate":"1.0","integration_active_job_enabled":"true","integration_active_job_service_name":"","integration_active_job_error_handler":"#\u003cProc:0x00007effaf758298 /rails/vendor/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/tracing/span_operation.rb:338\u003e","integration_active_record_analytics_enabled":"false","integration_active_record_analytics_sample_rate":"1.0","integration_active_record_enabled":"true","integration_active_record_service_name":"postgres"}

E, [2024-02-21T22:01:48.199084 #1] ERROR -- ddtrace: [ddtrace] (/rails/vendor/ruby/3.2.0/gems/ddtrace-1.20.0/lib/datadog/tracing/transport/http/client.rb:41:in `rescue in send_request') Internal error during Datadog::Tracing::Transport::HTTP::Client request. Cause: Errno::ECONNREFUSED Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126) Location: /rails/vendor/ruby/3.2.0/gems/net-http-0.4.1/lib/net/http.rb:1603:in `initialize'

I, [2024-02-21T22:01:47.684158 #1]  INFO -- ddtrace: [ddtrace] DATADOG CONFIGURATION - CORE - {"date":"2024-02-21T22:01:47+00:00","os_name":"x86_64-pc-linux","version":"1.20.0","lang":"ruby","lang_version":"3.2.2","env":"sandbox","service":"horizon-api-migrator","dd_version":"0d5cd92cfdd3bdf1c7d24eda79a04472bdbf8979","debug":false,"tags":"env:sandbox,version:0d5cd92cfdd3bdf1c7d24eda79a04472bdbf8979","runtime_metrics_enabled":false,"vm":"ruby-3.2.2","health_metrics_enabled":false}

I, [2024-02-21T22:01:47.683098 #1]  INFO -- ddtrace: [ddtrace] DATADOG CONFIGURATION - PROFILING - {"profiling_enabled":false}

Describe what you expected:
this shouldn't happen?

Steps to reproduce the issue:
see above. we cannot 100% reliably trigger this.

Additional environment details (Operating System, Cloud provider, etc):
aws
linux
also using the lambda log forwarder

is there more debugging we can enable on the dd side to understand what's going on?

was a change to this docker image pushed out?

The text was updated successfully, but these errors were encountered:

bjclark13 · 2024-02-22T13:42:50Z

I believe we are seeing similar issues using the Datadog Agent as a sidecar on Fargate. The Datadog Agent exits with a 137 status code.

We are also using ruby dd-trace, hopefully that's helpful in narrowing down the issue.

JoshuaSchlichting · 2024-02-27T15:43:13Z

I'm having this issue as well. I'm pinning the container to version datadog/agent:7.50.3 to avoid this for now.

modosc · 2024-02-27T20:39:11Z

edit: nevermind, we just saw another crash so the suggested fix did not help.

bjclark13 · 2024-02-28T13:26:25Z

Our experience is that pinning the agent version keeps the ECS task from failing, but we can still see in the logs the issue between the containers:

Internal error during Datadog::Tracing::Transport::HTTP::Client request. Cause: Errno::ECONNREFUSED Failed to open TCP connection to localhost:8126

praveensudharsan · 2024-03-11T09:29:53Z

Not able to use the datadog/agent:latest version of datadog-agent as sidecar agent in AWS EKS with Fargate node and container terminated immediately with the following error.

s6-overlay-preinit: fatal: unable to chown /var/run/s6: Operation not permitted

Currently using datadog/agent:7.50.3 version to avoid this issue for now.

LukaszBancarz · 2024-03-16T09:47:14Z

We have the same problem with "latest" version running in AWS ECS Fargate sidecar container. Needed to use 7.50.3 to stop crashing our applications.

clamoriniere · 2024-03-20T21:15:42Z

Hi @modosc @JoshuaSchlichting @LukaszBancarz @praveensudharsan @bjclark13

Thanks for creating and commenting this issue. However it seems that different issues are happening.
Could you all contact our support with an agent flare and the ecs task definition, so we can better understand the different scenarios that lead to an agent crash with the 7.51.x release. then share with us the support ticket id.

modosc · 2024-03-20T23:16:29Z

@clamoriniere i've got 1566759 opened currently.

bjclark13 · 2024-03-21T15:37:34Z

Support case 1582012 for me

Yorkerrr · 2024-04-26T16:14:56Z

It seems it connected with update of s6-overlay to v1.22.1.0 in a docker image.

in /etc/s6/init/init-stage1 there are additional block that fails

if { /bin/s6-overlay-preinit }

modosc added the team/triage label Feb 21, 2024

clamoriniere added team/containers kind/bug and removed team/triage labels Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] agent Dockerfile started crashing #23048

[BUG] agent Dockerfile started crashing #23048

modosc commented Feb 21, 2024 •

edited

bjclark13 commented Feb 22, 2024 •

edited

JoshuaSchlichting commented Feb 27, 2024

modosc commented Feb 27, 2024 •

edited

bjclark13 commented Feb 28, 2024

praveensudharsan commented Mar 11, 2024

LukaszBancarz commented Mar 16, 2024

clamoriniere commented Mar 20, 2024

modosc commented Mar 20, 2024

bjclark13 commented Mar 21, 2024

Yorkerrr commented Apr 26, 2024

[BUG] agent Dockerfile started crashing #23048

[BUG] agent Dockerfile started crashing #23048

Comments

modosc commented Feb 21, 2024 • edited

bjclark13 commented Feb 22, 2024 • edited

JoshuaSchlichting commented Feb 27, 2024

modosc commented Feb 27, 2024 • edited

bjclark13 commented Feb 28, 2024

praveensudharsan commented Mar 11, 2024

LukaszBancarz commented Mar 16, 2024

clamoriniere commented Mar 20, 2024

modosc commented Mar 20, 2024

bjclark13 commented Mar 21, 2024

Yorkerrr commented Apr 26, 2024

modosc commented Feb 21, 2024 •

edited

bjclark13 commented Feb 22, 2024 •

edited

modosc commented Feb 27, 2024 •

edited