New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon.ContainerLogs(): fix resource leak on follow #37576
Conversation
Looks like there's a failure; https://jenkins.dockerproject.org/job/Docker-PRs/50259/console That test was marked as flaky a couple of times; https://github.com/moby/moby/search?q=TestLogsFollowSlowStdoutConsumer&type=Issues could that flakiness be related?
|
Fortunately I know a bit about this test. OK, so now we know why this code was needed :) Need to think. |
Simplest reproducer: ID=$(docker run -d busybox sh -c "seq 1 100000"); docker logs --follow $ID | tail Should output something like
Instead it outputs:
|
@kolyshkin but that's only with i.e., this works;
And so does this (with a tty attached, so
|
Nah, tail is just there to not clutter the terminal; it does not affect test case outcome. What tail does is it reads all the input into a circular buffer (sized N lines, default N is 20) until EOF, then prints the buffer contents. It seems the problem is logWatcher.Close() is called when a container is stopped, not when a reader is gone. I will sleep on it and hopefully come up with something tomorrow. |
Lines 106 to 114 in e158451
Where So we should not be trying to send logs when the log watcher is closed. It would also be nice perhaps to change the log reader interface to take a context instead of closing the log watcher, may help simplify some of the logic. Honestly, there are too many goroutines here (one layer added relatively recently to support swarm logs). |
The moby/daemon/logger/jsonfilelog/jsonfilelog.go Lines 169 to 177 in 09f5e9d
So, it is called twice, once upon container stop, once upon consumer is gone. Whoever calls it first depends on the situation. In this scenario the logger is fast and the consumer is slow(er), so the first call is from container stop, thus (with the commit in this PR) the log consumer does not receive the remainder of the log. If we comment out this line: moby/daemon/logger/jsonfilelog/jsonfilelog.go Line 175 in 09f5e9d
For now, it seems like the best behavior would be to distinguish between "container stopped logging" and "logs consumer is gone" events, and act accordingly, i.e:
Unfortunately it further complicates the code, which I wanted to avoid :-\ |
Codecov Report
@@ Coverage Diff @@
## master #37576 +/- ##
=========================================
Coverage ? 36.13%
=========================================
Files ? 609
Lines ? 45056
Branches ? 0
=========================================
Hits ? 16283
Misses ? 26532
Partials ? 2241 |
We have to react to two distinct events here:
A single context won't be enough here. Sigh :( |
e6c8001
to
16bc899
Compare
OK, a test case TODO: write a test case for #37391 |
ppc failure on DockerSwarmSuite.TearDownTest is #33041 |
OK, I removed the fixing commit and added the test case, doing a CI cycle to see if the newly added test case is detecting the issue. |
Yeah, so basically it looks like we just need to split notifications up. Would be nice not to modify the log watcher for this. |
Here's my current solution (will push here once test case is checked to work): kolyshkin@16bc8991a It is somewhat ugly and big, so I am wide open to any ideas how to make it look better. |
84681b8
to
aa3bc5a
Compare
206574e
to
a91ea0a
Compare
Nope, the real fix is apparently commit 8269ff0dc5c (and that's another issue #27782 masks/workarounds). Anyway, all three fixes to filenotify make some sense. |
@cpuguy83 are you aware of any (unit or integration) test cases to check log rotation? As this PR touches that precious code that deals with rotated log files, I want to assess whether it broke something in that area... |
ci failures janky:
power:
z:
both are flaky and unrelated to this PR |
I think this PR is ready for review/merge |
a91ea0a
to
8c94b56
Compare
Speaking of log rotation, I was not sure I hadn't broken anything in that area, so I wrote a small test: ID=$(docker run -d --log-opt=max-size=50K --log-opt=max-file=20 busybox sh -c 'i=0; while test $i -lt 10000; do echo $i; let i++; sleep 0; done')
docker logs --follow $ID | tee $ID.logs | wc -l If it works, it should print 10000 (number of logs lines received). It works on current master as well as with the code from this PR. I am not sure if I should make an integration test case out of it, but in any case this can be addressed in a followup. PS note if I remove |
On my laptop, the success of following rotated logs depends on the output rate, not the log rotation rate. I guess it is because most time is spent decoding json :-\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, minor comment on timeouts in the test.
LGTM
This code has many return statements, for some of them the "end logs" or "end stream" message was not printed, giving the impression that this "for" loop never ended. Make sure that "begin logs" is to be followed by "end logs". Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This test case checks that followLogs() exits once the reader is gone. Currently it does not (i.e. this test is supposed to fail) due to moby#37391. [kolyshkin@: test case Brian Goff, changelog and all bugs are by me] Source: https://gist.github.com/cpuguy83/e538793de18c762608358ee0eaddc197 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When daemon.ContainerLogs() is called with options.follow=true (as in "docker logs --follow"), the "loggerutils.followLogs()" function never returns (even then the logs consumer is gone). As a result, all the resources associated with it (including an opened file descriptor for the log file being read, two FDs for a pipe, and two FDs for inotify watch) are never released. If this is repeated (such as by running "docker logs --follow" and pressing Ctrl-C a few times), this results in DoS caused by either hitting the limit of inotify watches, or the limit of opened files. The only cure is daemon restart. Apparently, what happens is: 1. logs producer (a container) is gone, calling (*LogWatcher).Close() for all its readers (daemon/logger/jsonfilelog/jsonfilelog.go:175). 2. WatchClose() is properly handled by a dedicated goroutine in followLogs(), cancelling the context. 3. Upon receiving the ctx.Done(), the code in followLogs() (daemon/logger/loggerutils/logfile.go#L626-L638) keeps to send messages _synchronously_ (which is OK for now). 4. Logs consumer is gone (Ctrl-C is pressed on a terminal running "docker logs --follow"). Method (*LogWatcher).Close() is properly called (see daemon/logs.go:114). Since it was called before and due to to once.Do(), nothing happens (which is kinda good, as otherwise it will panic on closing a closed channel). 5. A goroutine (see item 3 above) keeps sending log messages synchronously to the logWatcher.Msg channel. Since the channel reader is gone, the channel send operation blocks forever, and resource cleanup set up in defer statements at the beginning of followLogs() never happens. Alas, the fix is somewhat complicated: 1. Distinguish between close from logs producer and logs consumer. To that effect, - yet another channel is added to LogWatcher(); - {Watch,}Close() are renamed to {Watch,}ProducerGone(); - {Watch,}ConsumerGone() are added; *NOTE* that ProducerGone()/WatchProducerGone() pair is ONLY needed in order to stop ConsumerLogs(follow=true) when a container is stopped; otherwise we're not interested in it. In other words, we're only using it in followLogs(). 2. Code that was doing (logWatcher*).Close() is modified to either call ProducerGone() or ConsumerGone(), depending on the context. 3. Code that was waiting for WatchClose() is modified to wait for either ConsumerGone() or ProducerGone(), or both, depending on the context. 4. followLogs() are modified accordingly: - context cancellation is happening on WatchProducerGone(), and once it's received the FileWatcher is closed and waitRead() returns errDone on EOF (i.e. log rotation handling logic is disabled); - due to this, code that was writing synchronously to logWatcher.Msg can be and is removed as the code above it handles this case; - function returns once ConsumerGone is received, freeing all the resources -- this is the bugfix itself. While at it, 1. Let's also remove the ctx usage to simplify the code a bit. It was introduced by commit a69a59f ("Decouple removing the fileWatcher from reading") in order to fix a bug. The bug was actually a deadlock in fsnotify, and the fix was just a workaround. Since then the fsnofify bug has been fixed, and a new fsnotify was vendored in. For more details, please see moby#27782 (comment) 2. Since `(*filePoller).Close()` is fixed to remove all the files being watched, there is no need to explicitly call fileWatcher.Remove(name) anymore, so get rid of the extra code. Should fix moby#37391 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should test that - all the messages produced are delivered (i.e. not lost) - followLogs() exits Loosely based on the test having the same name by Brian Goff, see https://gist.github.com/cpuguy83/e538793de18c762608358ee0eaddc197 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
8c94b56
to
f845d76
Compare
Test timeouts increased; rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CI is green 👯♂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, and all green; let's merge 👍
light of new update from moby. moby/moby#37576
When
daemon.ContainerLogs()
is called withoptions.follow=true
(as in
docker logs --follow
), theloggerutils.followLogs()
function never returns (even then the logs consumer is gone).
As a result, all the resources associated with it (including
an opened file descriptor for the log file being read, two FDs
for a pipe, and two FDs for inotify watch) are never released.
If this is repeated (such as by running
docker logs --follow
and pressing
Ctrl-C
a few times), this results in DoS caused byeither hitting the limit of inotify watches, or the limit of
opened files. The only cure is daemon restart.
Apparently, what happens is:
logs consumer is gone, properly calling
(*LogWatcher).Close()
WatchClose()
is properly handled by a dedicated goroutine infollowLogs()
, cancelling the context.Upon receiving the
ctx.Done()
, the code tries to synchronouslysend the log message to the
logWatcher.Msg
channel. Since thechannel reader is gone, this operation blocks forever, and
resource cleanup set up in defer statements at the beginning
of
followLogs()
never happens.The fix is to remove the synchronous send.
This commit also removes the code after it, which tried to
read and send the rest of the log file. The code in question
first appeared in commit c0391bf ("Split reader interface from
logger interface"), but it is still unclear why it is/was
needed (it makes no sense to write logs once the consumer
signaled it is no longer interested in those).
There are a few alternative approaches to fix the issue,
such as:
a. amend the consumer to read all the data from the channel
until it is closed. This can be done in, say,
(*LogWatcher) Close()
.For one thing, this looks ugly (why read and write useless data?).
Also, this approach requires closing the channel from the sending
side, which, given the amount of exit paths, looks less elegant
than a solution in this commit.
b. always write to the channel asynchronously (i.e. in a non-blocking
fashion, i.e. inside a select). This is actually what happens
in the code after this commit.
Should fix #37391
image from https://dubikvit.livejournal.com/409094.html