New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguous i/o timeouts #13337
Comments
@stevvooe anything here that sounds familiar to you? |
I was also shown this, which may be relevant: golang/go#6336 |
@thaJeztah There are issues with timeouts in the registry client code within the daemon. We are actually working to make the transport instantiation much cleaner to address this. @jzelinskie Please take a look at docker-archive/docker-registry#286. It sounds really similar. The main issue was than IP address update was not picked up by DNS in the docker daemon. Read through the comments and see if the behavior seems similar. Other than that, there isn't much information to go on here. There is a lot of variation that can cause an io timeout, such as machine load and network load. Without knowing the specific conditions of each failure, this might be a wild goose chase. 🐦
Based on this, it seems like we may dealing with an operating system issue (DNS cache or other problem). While I'd recommend you keep collecting information and attempt to reproduce the issue. When this does happen, try to collect as much information as possible. |
@stevvooe I've read over that issue (and those related) in the past when attempting to help people debug the issue. I've semi-disregarded it because I had reports of people experiencing it on docker versions that were using Go 1.3+. At this point, I'm not sure if all the different Would you mind if I pointed people your way on IRC so we could collaborate in real time to collect information from the next report I receive? |
@jzelinskie I'm not sure if you read all the way through, but did you see docker-archive/docker-registry#286 (comment)? It is not related to Go 1.3 but to fallback DNS servers. Considering the perceived behavior and that this is related to DNS, it is a very likely candidate. Considering there are different symptoms, it might not be the right thing to place all these users in the same bucket. It sounds like there is an issue that is resolved by restart and another that is slightly different. It's possible they have the same root cause, but that assumption may hinder your analysis. Yes, please do point them to IRC if they are experiencing an issue. We'll try to collect the right information. |
Hi all, This is a big issue for us. We're seeing very frequent We're seeing this problem on various Docker daemon versions ranging from 1.3.2 to 1.6.0. We're running on Amazon Linux on AWS in We've been trying to troubleshoot this problem with @jzelinskie for a long time now, to no avail. Help would be greatly appreciated. Let me know if there is any information you want that could help debug the problem. I'm also available on #docker as |
@mpetazzoni I'd recommend dropping into the docker-distribution IRC channel where someone can help with your issue. If you could collect information about when it happens, when it does not happen, the state of the network (are packets being dropped?), round trip time, etc, it might help to divide the problem space. Does it happen only with quay or does it happen with other registries? Unfortunately, IO timeouts can happen for a number of reasons. Eliminating any possible cause without more information will hinder the investigation. @dmp42 Mind taking a look? |
For reference, the error message from the Docker daemon's log:
|
@mpetazzoni so, the error unfortunately is non descriptive. Like @stevvooe pointed out, different conditions might end up there. Might be the client fault, might be the server fault, might be a network issue. Bottom-line: we need more data. Figuring what's the occurrence of this would definitely help (once in ten attempt? more? less?), also if this is affecting other registries (docker hub) similarly. Would you be able to collect network information (while this fails) using tcpdump, or wireshark? I'll be on irc (#docker-distribution) tomorrow at about 10AM PST. A final note: I don't know what quay.io is running - headers indicate tengine, and I know that at some point they forked out the python registry. |
@dmp42 We've never had anyone report getting this error in between requests in something like a pull or push -- only at first connection which is always hitting the ping endpoint for every registry except the Docker Hub. The If you need any more info, feel free to query me whenever on IRC: jzelinskie@freenode. |
The errors are unfortunately seemingly random, but happen with a pretty high frequency in our automated deployment process, and from a variety of hosts in our environment. Again, from which one is pretty much random, so getting any kind of tcpdump or network trace is close to impossible since (a) we can't predict if it will happen and (b) we can't predict on which host it will happen. We have not been able to correlate these failures with any kind of other network issues on our end. As far as we can tell, the network sees no glitches while this happens. I've even tried to reproduce the problem by pulling an image from Quay.io in a loop, and in two days it never happened. As @jzelinskie said it's only at first connection, while hitting the ping endpoint. |
@jzelinskie thanks for the infos. On top of the head I see no reason why go would fail on _ping and not on other requests - on the other hand, since every communication with a private registry does start with a _ping, it would make sense that the symptom starts here if for some reason network communication is disrupted. @mpetazzoni unfortunately, there is no way out without some network traces. Also, I absolutely need to understand the frequency of this - if only to understand the likeliness of this being a code issue, or a network issue. Given how rare the symptom seems to be (you mentioned 2 days), doing curl at random moments as a sanity check probably holds little value. Others: if someone has the means to dig into tcpdump-ing into this, can you reach out tomorrow 10PST? Thanks. |
Also encountered this problem. A reboot resolved it. System info below. This is an EC2 instance running CoreOS. It had been up for 12 days, and up for a month before that. $ docker info
Containers: 2
Images: 112
Storage Driver: btrfs
Build Version: Btrfs v3.17.1
Library Version: 101
Execution Driver: native-0.2
Kernel Version: 3.19.3
Operating System: CoreOS 647.0.0
CPUs: 2
Total Memory: 3.863 GiB
Name: ip-10-1-2-196
ID: QEOG:MEK5:LZRJ:N6QM:53O2:GYZS:CIGZ:2FFS:BNAO:H4EK:QFI4:GBGJ $ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.3.3
Git commit (client): a8a31ef-dirty
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.3.3
Git commit (server): a8a31ef-dirty $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=647.0.0
VERSION_ID=647.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 647.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues" $ journalctl -u docker
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="POST /v1.17/images/create?fromImage=quay.io%2Fc2fo%2Fauth-manage&tag=1.7.1"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job pull(quay.io/c2fo/auth-manage, 1.7.1)"
May 28 14:35:09 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:09Z" level="info" msg="+job resolve_repository(quay.io/c2fo/auth-manage)"
May 28 14:35:15 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:15Z" level="info" msg="-job resolve_repository(quay.io/c2fo/auth-manage) = OK (0)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: Get https://quay.io/v1/_ping: dial tcp: i/o timeout
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="info" msg="-job pull(quay.io/c2fo/auth-manage, 1.7.1) = ERR (1)"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="Handler for POST /images/create returned error: Get https://quay.io/v1/_ping: dial tcp: i/o timeout"
May 28 14:35:25 ip-10-1-2-196 dockerd[526]: time="2015-05-28T14:35:25Z" level="error" msg="HTTP Error: statusCode=500 Get https://quay.io/v1/_ping: dial tcp: i/o timeout"
|
@tim-kretschmer-c2fo can you clarify? If you can reproduce such a situation, can you get a tcpdump showing the failed requests? Also, "a8a31ef-dirty" for a git commit suggest you are running a modified version of docker. Can you provide details on what has been modified? Thanks. |
@dmp42 this was a random occurrence, and the first time it happened to us. If the machine enters that state again I will try to get that dump and paste the result here. I do not think we modified anything in docker, whatever version is running on CoreOS is what we have been using. |
@jzelinskie speaking about DNS resolution, this one is definitely an issue: #10863 ipv6/ipv4 resolution preference at large is not well defined. Probably not a Docker issue per-se, but still. |
@jzelinskie Removing the milestone simply means this doesn't seem like it's going to make it for 1.7.0. We need more eyes from the networking team on this (@Madhu @mrjana). |
@jzelinskie IIRC the understanding about this specific ticket so far is that in a mixed world (ipv6/ipv4) resolution precedence is not well defined - not to mention wildcard matching is a recipe for disaster. Either way, I would gladly do whatever it takes to figure any registry issue - unfortunately given the very nature of these, it's paramount that we manage to get access to a reproducible test case, or tcpdumps. |
I'm not sure I understand exactly why this could be related to IPv4/IPv6 resolution precedence issues. Our DNS setup is completely vanilla from Amazon Linux, and Quay.io does not expose AAAA records at all. Plus, our error message clearly shows an IPv4 address the i/o timeout happens on. Our
And, as done from one of our instances:
The only thing I can think of, is that Quay.io's set of IP addresses changes, and when it does we somehow still try to hit one of the previous ones and it fails, so it would be more of a cache/TTL issue, either with Go's DNS code, or with Amazon's DNS servers? |
@mpetazzoni I was just pointing out one of the many reasons an error like that would happen. Since the error in non-specific, guessing is useless to solve your specific case. Without tcpdump or a reproducible test case, there is nothing to do... |
Any suggestions on how to get a tcpdump on something I have no idea when or even if it will happen? I'm all ears. |
@mpetazzoni capturing all docker traffic for a day on a given machine is the only way I can think of. I wish I had a better solution, but short of figuring out an obvious code bug, getting information or reproduction is the only way to figure out what's going on. |
I faced same issue on our staging server. Initially we thought there may be some issue with docker hub but after done some more investigation we found that there a lot of dangling() images. After we removed all of them we were able to push the image to docker hub. Looks like docker is not able to handle this scenario. Not sure if this is due to dangling images or the number of images. The error is very confusing "docker timeout exceeded". |
Not sure if this helps w/ tracking down root cause or not, but thought I'd toss it out there as something to investigate. I got an error similar to this after experimenting w/ setting up SkyDNS as the first nameserver in I've since concluded that what I tried to do might not be a good idea and I've stopped doing it. As soon as I reordered the name servers in |
@sporkmonger This sounds relevant. @mpetazzoni I wonder if this has something to do with the fact that the IP |
So this would be the result of the two issues:
If I understand correctly, switching to a routable, more reliable DNS server address ( |
@mpetazzoni That conclusion is probably goes a little far (and is pretty unfair to Go's DNS library). We still don't have any evidence that is part of the problem. Where is the |
It's the default |
@mpetazzoni Ok, so I understand you're theory a little bit better. Go's DNS picks up the unresolved value and starts black-holing DNS requests. Is it possible that docker is coming up too early in the startup process (systemd, init.d, etc.)? Can we make it depend on a full and successful DHCP resolution, including the nameservers, before starting up? |
I think our Docker daemon starts at the right time in the boot sequence and the networking is already all correctly setup by then. Would you suggest we try using By the way, we haven't encountered this timeout in ~3 weeks. We've been progressively upgrading our system packages and Docker version on our instances over the past several weeks, so maybe that helped (newer Go version?) ? But of course it doesn't prove anything. |
@mpetazzoni Networking may be setup but DHCP may not have set a nameserver in resolv.conf. Digressing, if you haven't seen it in a few weeks, perhaps the root cause has disappeared. Let's keep watching. If we see the timeout again, we'll look at the DHCP -> resolv.conf process and monitor from there. Thanks for sticking this out. |
We're seeing this issue, we've got our registry's hostname in /etc/hosts to try and work around any possible DNS issue, and we're still seeing it. We're running on precise, version info: $ docker version |
I work with @notnownikki and I can add that 107 out of the 481 "docker pull" commands our CI system has run over the past week resulted in the i/o timeout error. |
@jesusaurus @notnownikki Please confirm that the error you're getting is |
Yes, this is dial tcp i/o timeout. |
I'm seeing the same issue on my teams servers in our data center (not AWS): Update: Restarting the box did not fix the issue Command
Error message
Docker version
|
I am receiving the i/o timeout on AWS.
|
I am getting similar problem while running the docker pull from Jenkins. I cant reproduce it when i try to ssh into server and run similar pull command. Once it starts happening which is very random, it happens 3-4 times continuously and stops. Most of the time i restart daemon and problem resolves. Surprisingly in a given Jenkins job few pull from the same repo works fine and some timeout. First error was this: time="2015-08-12T10:57:43-07:00" level=fatal msg="Failed to upload layer: Put https://ussf-prd-lndv03:5000/v1/images/bad577d45faa011a1fae043c32204690790e5aa90a618f44e6b790aee8695537/layer: dial tcp 10.50.76.27:5000: connection timed out" which came 2 times and later converted to this error : time="2015-08-12T11:16:13-07:00" level=fatal msg="Error response from daemon: v1 ping attempt failed with error: Get https://ussf-prd-lndv03:5000/v1/_ping: dial tcp 10.50.76.27:5000: i/o timeout. If this private registry supports only HTTP or HTTPS with an unknown CA certificate, please add The problem resolved for now (and in past) by restarting daemon. Restarting registry didn't help. $ docker version |
I am not entirely sure if this is related, however, this seems the most relevant issue. I had been using boot2docker and removed that in order to install the DockerToolbox. The installation completes, and I am presented with the whale screen saying docker ins configured to use the default machine. When I try to run any docker command I am presented with the following:
This is very baffling. I have found a few solutions people have claimed to fix this issue, but I have had no luck. It's confusing that I would get the message saying docker is running on a given IP and then would immediately get a message saying it cannot connect. |
@jcarpe While the error message for your issue is similar, the causes are very different. The problem you describe looks like an installation issue with toolbox (probably routing related). I'd suggest filing an issue over at https://github.com/docker/toolbox or seeking help through IRC. |
I am seeing the following reproducibly in a VM that I am provisioning via Ansible. Executing the same steps from the AWS console results in an instance that does not exhibit this same behavior. I am not setting up any custom routing or anything, just using a base AWS Linux image and then installing docker.
|
Actually, I terminated the instances, and started new ones from Ansible, and I can no longer duplicate this issue. |
@everyone anyone knows if there's some sort of timeout and retry mechanisim on a docker pull? I'm trying to automate a fresh install with the latest update on debian jessie and from times to times in vagrant it hangs for a very long time. I'm just wondering if we can setup a timeout and retry for nasty dev / production environments. For instance, tonight I'm working on a laptop with a wifi connection and it has hard times to pull an image. I'm just glad it happened in such conditions, because sometime in with cloud computing you can expect such behaviours when your cloud provide is shy about sharing such information due to SLAs. Here's some debug session on the docker process: root@pxe:/home/vagrant# strace -p 12925 Attaching to process 12925 Sorry, I don't have the proper debug info :\ I guess I got a bit lazy here :) |
@mrfoobar1 I'd recommend opening another issue to request that feature. The right approach here is a timeout to first response header, but this can be hard to control. Also, if you're not getting the Since this issue is a little vague and we haven't had recent reports, I'm going to go ahead and close this. |
Hello, We use Go Docker library rather than command-line utility. In our data-center this only happens on two out of about a dozen ESXi machines (I'm saying about a dozen because machines execute very different workloads wrt Docker operations, the machines in question run nightly builds, each of which performs roughly 500 operations a day). We are seeing this issue intermittently, not very often. It may happen twice a day, or may wait for the next time for 3-4 days. We ran several network tests while running nightlies, and we don't see any networking problems. Packet drops are within expected figures. At the same time we are doing a lot of networking against Apache httpd, and we don't see any problems there. The exact error we are receiving is:
Docker client is built from revision $ docker version
Client:
Version: 1.8.2
API version: 1.20
Go version: go1.4.2
Git commit: 0a8c2e3
Built: Thu Sep 10 19:08:45 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.8.2
API version: 1.20
Go version: go1.4.2
Git commit: 0a8c2e3
Built: Thu Sep 10 19:08:45 UTC 2015
OS/Arch: linux/amd64 $ docker info
Containers: 5
Images: 201
Storage Driver: devicemapper
Pool Name: docker-8:1-752829-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 5.853 GB
Data Space Total: 107.4 GB
Data Space Available: 9.154 GB
Metadata Space Used: 10.01 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.137 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-123.13.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 31.27 GiB
Name: loader6a
ID: ZXOZ:Y25P:V7YG:EQAJ:MIZA:TUEM:OFBV:4LCN:AZ6H:4ABH:7SAF:PKOV What may be different in our setup is that we have 6 network interfaces available at any moment. Still, this shouldn't, in principle, confuse the client code, but looking at golang/go#6336 I just thought I'd mention it. |
It seems we have similar issue also, run: error getting repository data: Get https://gcr.io/v1/repositories/google_containers/gci-mounter/images: dial tcp 74.125.70.82:443: i/o timeout Could we reopen this? |
@jingxu97 This issue is describes a mostly generic |
My Solution. I got this kind of error when I'm trying to install Tensorflow in docker. Following the tutorial of tensorflow, i run the command |
Description of problem:
I've had numerous people report an issue connecting to registries (on prem, Docker Hub, and Quay.io) that has been quite tricky to track down. It can begin and end at seemingly random times.
It doesn't matter what API call is made (as long as it needs to connect to a registry), docker fails to establish a connection to the registry (in the case of every registry except the Docker Hub this endpoint is
/v1/_ping
). This problem persists despite docker daemon being restarted, but does not persist once the machine has been rebooted. Usingcurl
to hit the endpoint works anddig
resolves the domain correctly, yet the docker daemon will continue to fail connecting to the machine. This leads me to believe the issue is not related to the DNS cache.The following data is taken from the last person reported suffering from this issue.
docker version:
docker info:
N/A
uname -a:
Linux Mint
Uptime for this box was only a few hours.
Environment details (AWS, VirtualBox, physical, etc.):
I've seen this occur specifically on
version 1.5.0, build a8a31ef
on Debian, Ubuntu, Amazon Linux via residential connections, GCE, and AWS. I'm not sure that this version is necessarily coupled with the issue, though.How reproducible:
I haven't been able to personally reproduce the issue.
Steps to Reproduce:
Actual Results:
Receive tcp i/o timeouts from a perfectly functioning registry.
Expected Results:
Never receive tcp i/o timeouts from a perfectly functioning registry.
Additional info:
See description.
The text was updated successfully, but these errors were encountered: