Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List API includes partially-created containers which cannot be inspected #44512

Closed
belugabehr opened this issue Nov 22, 2022 · 5 comments · Fixed by #44629
Closed

List API includes partially-created containers which cannot be inspected #44512

belugabehr opened this issue Nov 22, 2022 · 5 comments · Fixed by #44629
Labels
area/api kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/confirmed

Comments

@belugabehr
Copy link

belugabehr commented Nov 22, 2022

Description

I am working on an application that was implemented before the various docker system prune commands were available and therefore had manually implemented pruning from the client side. The logic is pretty straightforward:

  1. Request a list of all docker containers
  2. For each container, inspect it for more details
  3. Make a decision about deleting the container or not (e.g., status, error code, creation date)

This process fails with some frequency as there are times when attempting to inspect a container that was returned by list returns a 404 code. I've noticed that it consistently fails to inspect containers that are in a "created" state.

In the application, there is one thread that is regularly performing this prune process, and another thread that is launching containers by first creating the container and then starting it.

# Example container: aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b

# Thread 1 - Creating the container
Nov 22 13:29:42 host.domain.com dockerd[1766]: time="2022-11-22T13:29:42.814004115Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/var, to f2044d2badae69af56bcc84b739b7e
Nov 22 13:29:43 host.domain.com dockerd[1766]: time="2022-11-22T13:29:43.064228098Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/root, to 16bb75cc1a51a5cd69a371b9b6cda
Nov 22 13:29:43 host.domain.com dockerd[1766]: time="2022-11-22T13:29:43.064312480Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/opt, to 844b97098daaf1040aa2cafcf1d3a4
Nov 22 13:29:44 host.domain.com dockerd[1766]: time="2022-11-22T13:29:44.381082225Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/run, to 87c3f7a34557a0b08e3be1d9404943
Nov 22 13:29:44 host.domain.com dockerd[1766]: time="2022-11-22T13:29:44.381892484Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/private, to 424c0af91b3c2cde0f7f44ab82
Nov 22 13:29:44 host.domain.com dockerd[1766]: time="2022-11-22T13:29:44.381916606Z" level=debug msg="copying image data from aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b:/tmp, to 5a77fcd42f70e14a8652a0a96df739

# Thread 2 - Prune process - List all of the containers and then inspect each container in a loop
Nov 22 13:29:47 host.domain.com dockerd[1766]: time="2022-11-22T13:29:47.211366766Z" level=debug msg="Calling GET /v1.23/containers/json?all=true"

# Thread 2 - Now returns a 404 error to the client - note these are subsequent calls in the logs
Nov 22 13:29:47 host.domain.com dockerd[1766]: time="2022-11-22T13:29:47.218430962Z" level=debug msg="Calling GET /v1.23/containers/aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b/json"

# Thread 2 - The container is in a very unexpected state - so try to delete it (also fails with a 404)
Nov 22 13:29:47 host.domain.com dockerd[1766]: time="2022-11-22T13:29:47.224637054Z" level=debug msg="Calling DELETE /v1.23/containers/aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b?v=true&force=true"

# Thread 2 - More pruning activity
Nov 22 13:29:47 host.domain.com dockerd[1766]: time="2022-11-22T13:29:47.230754029Z" level=debug msg="Calling GET /v1.23/volumes?filters=%7B%22dangling%22%3A%5B%22true%22%5D%7D"
Nov 22 13:29:47 host.domain.com dockerd[1766]: time="2022-11-22T13:29:47.230796460Z" level=debug msg=VolumeStore.Find ByType=service.andCombinator ByValue="[0x5581d5dc3500 false]"

# Thread 1 - Now the container gets started
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190019073Z" level=debug msg="Calling POST /v1.23/containers/aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b/start"
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190411705Z" level=warning msg="Specifying a kernel memory limit is deprecated and will be removed in a future release."
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190692420Z" level=debug msg="container mounted via layerStore: &{/local/docker/200000.200000/overlay2/dac334114dcbd314826c1b4a2531b9036b8a7a6a399067c4333d85e
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190905625Z" level=debug msg="Assigning addresses for endpoint confident_kilby's interface on network bridge"
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190928606Z" level=debug msg="RequestAddress(LocalDefault/X.X.0.0/16/X.X.X.X/16, <nil>, map[])"
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.190957426Z" level=debug msg="Request address PoolID:X.X.X.X/16 App: ipam/default/data, ID: LocalDefault/X.X.X.X/16, DBIndex: 0x0, Bits: 65536, Unselect
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.206877322Z" level=debug msg="Assigning addresses for endpoint confident_kilby's interface on network bridge"
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.210671885Z" level=debug msg="Programming external connectivity on endpoint confident_kilby (acde711aa29fb7c1a507e54744e19051cfb3af21d0d43d2a234efbd58284ddb5)
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.211956905Z" level=debug msg="EnableService aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b START"
Nov 22 13:29:50 host.domain.com dockerd[1766]: time="2022-11-22T13:29:50.211970514Z" level=debug msg="EnableService aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b DONE"

Subsequent attempts to prune the container are successful. It seems perhaps that there is a small window of time where the container is in a state that makes it a candidate for list but not for inspect.

Reproduce

Happens randomly. I was unable to mimic this with CLI.

Expected behavior

Assuming no external modification between calls, all containers returned by list are available to inspect.

docker version

Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.15.14
 Git commit:        f0df350
 Built:             Wed Nov 17 03:05:36 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.14
  Git commit:       b0f5bc3
  Built:            Wed Nov 17 03:06:14 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0
  GitCommit:        84113eef6fc27af1b01b3181f31bbaf708715301
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan (Redacted) null overlay
  Log: (Redacted)
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc version: 84113eef6fc27af1b01b3181f31bbaf708715301
 init version: de40ad0
 Security Options: (Redacted)
 Kernel Version: 5.10.144-111.639.amzn2int.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.84GiB
 Name: host.domain.com
 ID: (Redacted)
 Docker Root Dir: /local/docker/200000.200000
 Debug Mode: true
  File Descriptors: 38
  Goroutines: 48
  System Time: 2022-11-22T21:01:45.968617867Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries: (Redacted)
 Live Restore Enabled: false

Additional Info

This is from the client perspective (docker-java):

# First inspect failure
2022-11-22T13:29:47.218Z [INFO]   (Timer-0)  Was going to remove container aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b but it no longer exists
2022-11-22T13:29:47.218Z [DEBUG]  (Timer-0)  Container aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b detailed error
com.github.dockerjava.api.exception.NotFoundException: Status 404: No such container: aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b

# Since inspect failed unexpectedly, things are weird, just try to delete the container
2022-11-22T13:29:47.224Z [WARN]  (Timer-0)  Container aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b cannot be force removed. Manual removal is required.
2022-11-22T13:29:47.224Z [DEBUG] (Timer-0)  Container aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b detailed error
com.github.dockerjava.api.exception.NotFoundException: Status 404: No such container: aee19890438494eeecbf587b11221198ff199fe29437ecaa466c76156177b63b
@belugabehr belugabehr added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Nov 22, 2022
@corhere
Copy link
Contributor

corhere commented Nov 29, 2022

Are you able to reproduce the issue with a more recent version? This issue might have been fixed by #43166 in v20.10.13.

@belugabehr
Copy link
Author

@corhere I am not knowledgeable of the docker codebase, but #43166 seems related to 'Dead' and 'RemovalInProgress' containers. In my case, the container in question is in being created and is not in either of these two states.

@belugabehr
Copy link
Author

One correction I'll make is that the title of this report is "Cannot Inspect" but in my scenario, the requests for both inspect and delete returned 404 error codes.

@belugabehr
Copy link
Author

belugabehr commented Dec 9, 2022

One other observation, based on some weird re-try code in our product to work around this issue, this issue is happening with our currently deployed version of docker-1.12.6. For whatever reason, we are seeing it occur more regularly when upgrading to 20.10.7

@corhere
Copy link
Contributor

corhere commented Dec 9, 2022

I reproduced the issue on the latest master after widening the race window. Before the Create API request returns, the List endpoint returns the container. Attempting to inspect a not-quite-created container doesn't just return 404: the daemon nil-dereference panics! Yikes. (Note: this is not a DoS or other security vulnerability. The daemon does not crash and remains responsive to subsequent API requests.)

diff --git a/daemon/create.go b/daemon/create.go
index b9efba80f7..9f3418055c 100644
--- a/daemon/create.go
+++ b/daemon/create.go
@@ -193,6 +193,10 @@ func (daemon *Daemon) create(ctx context.Context, opts createOpts) (retC *contai
 		return nil, err
 	}
 
+	logrus.WithField("id", ctr.ID).Info("Sleeping 30s in the middle of creating container")
+	time.Sleep(30 * time.Second)
+	logrus.WithField("id", ctr.ID).Info("Done sleeping")
+
 	var endpointsConfigs map[string]*networktypes.EndpointSettings
 	if opts.params.NetworkingConfig != nil {
 		endpointsConfigs = opts.params.NetworkingConfig.EndpointsConfig
root@c6cf37d0b0fe:/go/src/github.com/docker/docker# docker create hello-world &
[2] 9580
root@c6cf37d0b0fe:/go/src/github.com/docker/docker# Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:faa03e786c97f07ef34423fccceeec2398ec8a5759259f94d99078f264e9d7af
Status: Downloaded newer image for hello-world:latest
INFO[2022-12-09T19:36:33.953695745Z] Sleeping 30s in the middle of creating container  id=9a3ae6f15fbb4a85a66511356c46e2db325adc2f95ed321eafe3a26957716a88

root@c6cf37d0b0fe:/go/src/github.com/docker/docker# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
9a3ae6f15fbb        hello-world         "/hello"            6 seconds ago       Created                                 agitated_mcclintock
root@c6cf37d0b0fe:/go/src/github.com/docker/docker# docker inspect 9a3ae6f15fbb4a85a66511356c46e2db325adc2f95ed321eafe3a26957716a88
2022/12/09 19:36:44 http: panic serving @: runtime error: invalid memory address or nil pointer dereference
goroutine 298 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1850 +0xbf
panic({0x563f153c4e80, 0x563f169dd520})
	/usr/local/go/src/runtime/panic.go:890 +0x262
github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc00086ec80, {0xc00098e166?, 0x563f14c91f50?}, 0x0)
	/go/src/github.com/docker/docker/daemon/inspect.go:40 +0x54
github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0x563f1539c2c0?, {0xc00098e166, 0x40}, 0x4?, {0xc00098e156, 0x4})
	/go/src/github.com/docker/docker/daemon/inspect.go:29 +0xbe
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc000f3de00, {0x563f15760fa0, 0xc00130c7b0}, {0x563f1575f4f0, 0xc0014b20e0}, 0x563f152b1d20?, 0xc000c64dc0?)
	/go/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0xba
github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1({0x563f15760fa0, 0xc00130c7b0}, {0x563f1575f4f0?, 0xc0014b20e0?}, 0x563f1529afe0?, 0xc0007d2500?)
	/go/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0x15b
github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1({0x563f15760fa0, 0xc00130c780}, {0x563f1575f4f0, 0xc0014b20e0}, 0x203000?, 0x40?)
	/go/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x4d7
github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1({0x563f15760fa0, 0xc00130c780}, {0x563f1575f4f0?, 0xc0014b20e0}, 0xc000895000, 0xc0007d24a0?)
	/go/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x649
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1({0x563f1575f4f0, 0xc0014b20e0}, 0xc000894f00)
	/go/src/github.com/docker/docker/api/server/server.go:144 +0x1ce
net/http.HandlerFunc.ServeHTTP(0xc000894d00?, {0x563f1575f4f0?, 0xc0014b20e0?}, 0xc000cad9e0?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000c0d500, {0x563f1575f4f0, 0xc0014b20e0}, 0xc000927200)
	/go/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:210 +0x1cf
net/http.serverHandler.ServeHTTP({0xc001250f00?}, {0x563f1575f4f0, 0xc0014b20e0}, 0xc000927200)
	/usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc00016c0a0, {0x563f15760fa0, 0xc0010fd080})
	/usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3102 +0x4db
2022/12/09 19:36:44 http: panic serving @: runtime error: invalid memory address or nil pointer dereference
goroutine 360 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1850 +0xbf
panic({0x563f153c4e80, 0x563f169dd520})
	/usr/local/go/src/runtime/panic.go:890 +0x262
github.com/docker/docker/daemon.(*Daemon).ContainerInspectCurrent(0xc00086ec80, {0xc00098e246?, 0x563f14c91f50?}, 0x0)
	/go/src/github.com/docker/docker/daemon/inspect.go:40 +0x54
github.com/docker/docker/daemon.(*Daemon).ContainerInspect(0x563f1539c2c0?, {0xc00098e246, 0x40}, 0x4?, {0xc00098e236, 0x4})
	/go/src/github.com/docker/docker/daemon/inspect.go:29 +0xbe
github.com/docker/docker/api/server/router/container.(*containerRouter).getContainersByName(0xc000f3de00, {0x563f15760fa0, 0xc00130ce10}, {0x563f1575f4f0, 0xc0014b22a0}, 0x563f152b1d20?, 0xc000c651e0?)
	/go/src/github.com/docker/docker/api/server/router/container/inspect.go:15 +0xba
github.com/docker/docker/api/server/middleware.ExperimentalMiddleware.WrapHandler.func1({0x563f15760fa0, 0xc00130ce10}, {0x563f1575f4f0?, 0xc0014b22a0?}, 0x563f1529afe0?, 0xc0007d2690?)
	/go/src/github.com/docker/docker/api/server/middleware/experimental.go:26 +0x15b
github.com/docker/docker/api/server/middleware.VersionMiddleware.WrapHandler.func1({0x563f15760fa0, 0xc00130cde0}, {0x563f1575f4f0, 0xc0014b22a0}, 0x203000?, 0x40?)
	/go/src/github.com/docker/docker/api/server/middleware/version.go:62 +0x4d7
github.com/docker/docker/pkg/authorization.(*Middleware).WrapHandler.func1({0x563f15760fa0, 0xc00130cde0}, {0x563f1575f4f0?, 0xc0014b22a0}, 0xc000895400, 0xc0007d2640?)
	/go/src/github.com/docker/docker/pkg/authorization/middleware.go:59 +0x649
github.com/docker/docker/api/server.(*Server).makeHTTPHandler.func1({0x563f1575f4f0, 0xc0014b22a0}, 0xc000895300)
	/go/src/github.com/docker/docker/api/server/server.go:144 +0x1ce
net/http.HandlerFunc.ServeHTTP(0xc000895200?, {0x563f1575f4f0?, 0xc0014b22a0?}, 0x800?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/docker/docker/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000c0d500, {0x563f1575f4f0, 0xc0014b22a0}, 0xc000895100)
	/go/src/github.com/docker/docker/vendor/github.com/gorilla/mux/mux.go:210 +0x1cf
net/http.serverHandler.ServeHTTP({0xc00130ca50?}, {0x563f1575f4f0, 0xc0014b22a0}, 0xc000895100)
	/usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc000ee41e0, {0x563f15760fa0, 0xc0010fd080})
	/usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3102 +0x4db
[]
error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.30/containers/9a3ae6f15fbb4a85a66511356c46e2db325adc2f95ed321eafe3a26957716a88/json: EOF

@corhere corhere changed the title Cannot Inspect Container Found in List Response List API includes partially-created containers which cannot be inspected Dec 9, 2022
corhere added a commit to corhere/moby that referenced this issue Dec 12, 2022
(*Container).CheckpointTo() upserts a snapshot of the container to the
daemon's in-memory ViewDB and also persists the snapshot to disk. It
does not register the live container object with the daemon's container
store, however. The ViewDB and container store are used as the source of
truth for different operations, so having a container registered in one
but not the other can result in inconsistencies. In particular, the List
Containers API uses the ViewDB as its source of truth and the Container
Inspect API uses the container store.

The (*Daemon).setHostConfig() method is called fairly early in the
process of creating a container, long before the container is registered
in the daemon's container store. Due to a rogue CheckpointTo() call
inside setHostConfig(), there is a window of time where a container can
be included in a List Containers API response but "not exist" according
to the Container Inspect API and similar endpoints which operate on a
particular container. Remove the rogue call so that the caller has full
control over when the container is checkpointed and update callers to
checkpoint explicitly. No changes to (*Daemon).create() are needed as it
checkpoints the fully-created container via (*Daemon).Register().

Fixes moby#44512.

Signed-off-by: Cory Snider <csnider@mirantis.com>
corhere added a commit to corhere/moby that referenced this issue Dec 13, 2022
(*Container).CheckpointTo() upserts a snapshot of the container to the
daemon's in-memory ViewDB and also persists the snapshot to disk. It
does not register the live container object with the daemon's container
store, however. The ViewDB and container store are used as the source of
truth for different operations, so having a container registered in one
but not the other can result in inconsistencies. In particular, the List
Containers API uses the ViewDB as its source of truth and the Container
Inspect API uses the container store.

The (*Daemon).setHostConfig() method is called fairly early in the
process of creating a container, long before the container is registered
in the daemon's container store. Due to a rogue CheckpointTo() call
inside setHostConfig(), there is a window of time where a container can
be included in a List Containers API response but "not exist" according
to the Container Inspect API and similar endpoints which operate on a
particular container. Remove the rogue call so that the caller has full
control over when the container is checkpointed and update callers to
checkpoint explicitly. No changes to (*Daemon).create() are needed as it
checkpoints the fully-created container via (*Daemon).Register().

Fixes moby#44512.

Signed-off-by: Cory Snider <csnider@mirantis.com>
(cherry picked from commit 0141c6d)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/confirmed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants