Fix awslogs batch size calculation #35726

jahkeup · 2017-12-07T00:08:40Z

- What I did

Added a type to encapsulate the batch and its associated size counter.
Cleaned up types and declarations for code cleanliness

- How I did it

The added type holds the batching mechanism and the counter used for determining whether or not to submit a batch of log events.

- How to verify it

The change can be verified by using the steps from #35725 to reproduce the issue and then comparing it to a build from this changeset. The appropriate test and test mock have been updated to validate this behavior.

- Description for the changelog
awslogs: fix batch size calculation for large logs

- A picture of a cute animal (not mandatory but encouraged)

cc @samuelkarp @adnxn

thaJeztah · 2017-12-07T00:20:20Z

ping @anusha-ragunathan PTAL

jahkeup · 2017-12-07T17:16:11Z

I'm investigating these errors now, I didn't have these same errors when running tests locally.

jahkeup · 2017-12-07T21:54:21Z

There's a race condition with the mock and the tests, I'm going to continue to narrow it down and fix it.

jahkeup · 2017-12-08T00:10:41Z

Looks like jenkins failed to get janky and powerpc tests up and running. Can we restart those tests?

thaJeztah · 2017-12-08T00:41:25Z

Restarted 👍

thaJeztah · 2017-12-08T00:43:59Z

@jahkeup if you need to make additional changes, can you remove the fixes #xxx from the commit message, and move it to the PR's description? GitHub can become noisy if such commits are cherry-picked or people fork the repo and do merges etc. (I know, I know) 😅

jahkeup · 2017-12-08T18:01:19Z

@thaJeztah no problem! I've removed the 'fixes' reference and made a mental note to not include that in the future 👍

@anusha-ragunathan PTAL, I tested the latest set of changes locally with ~500 rounds to see if the test flaked out - this seems stable now and is passing tests.

thaJeztah · 2017-12-08T18:30:05Z

daemon/logger/awslogs/cloudwatchlogs.go

+func (b *eventBatch) Events() []wrappedEvent {
+	// events in a batch must be sorted by timestamp
+	// see http://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html
+	sort.Sort(byTimestamp(b.events))


Isn't this undoing 443f251 (#24814)? /cc @samuelkarp

This is respecting that, the code that stable sorts according to timestamp and then insertion order is implemented in the sort for this type (https://github.com/jahkeup/moby/blob/b0046c3e68579640b5b08a4cb8b754924fbc2a30/daemon/logger/awslogs/cloudwatchlogs.go#L631).

The insertion order is recorded as the wrappedEvent is crafted.

Ah, thanks! Sorry, it wasn't clear directly 😊 Do you think we need to change byTimestamp to something else? Alternatively, move the description of sorting to the GoDoc of the Events() method, and describe that it's sorting by timestamp, but preserving insertOrder if timestamp is equal

I'll update the GoDoc for this method to clarify the action and results of retrieving the events from the batch.

The byTimestamp type could probably be better named for its usages, however I can't think of a more apt name without stringing together many many words. Do you have any suggestions on a different name? Do you think the GoDoc statement could instead be more clear without needing a name change?

anusha-ragunathan · 2017-12-08T19:17:59Z

daemon/logger/awslogs/cloudwatchlogs.go

+// necessary overhead for an event to be logged.
+func (b *eventBatch) Add(event wrappedEvent, size int) {
+	b.bytes = b.bytes + size + perEventBytes
+	b.events = append(b.events, event)


Looks thread unsafe.

Same case with Count and Size.

Indeed. This type isn't intended for concurrently use and isn't accessed concurrently as it stands - hence the unsafe operations. Do you have suggestions to improve this given the current implementation of its consumer? My goal here was to keep the type small and simple for this use case, but I am happy to add some mechanisms for safety.

May need a mention in the function description; also given that it's only use in this file, we should probably un-export these methods

Sounds good. This type is only used here so I'll unexport the methods. I'll also include a statement about the thread unsafe nature in each method's GoDoc line to make sure future developers understand this design assumption.

thaJeztah · 2017-12-08T19:38:54Z

daemon/logger/awslogs/cloudwatchlogs.go

@@ -460,37 +411,30 @@ func (l *logStream) processEvent(events []wrappedEvent, unprocessedLine []byte,
 		}
 		line := unprocessedLine[:lineBytes]
 		unprocessedLine = unprocessedLine[lineBytes:]
-		if (len(events) >= maximumLogEventsPerPut) || (bytes+lineBytes+perEventBytes > maximumBytesPerPut) {
+		if (batch.Count() >= maximumLogEventsPerPut) || (batch.Size()+lineBytes+perEventBytes > maximumBytesPerPut) {


Just thinking out loud here; wondering if we could make batch.Add() reject the event if the Size or Count limit was reached, something like (dummy code);

err := batch.Add(event) switch err := err.(type) { case <maximum size reached>, <maximum logs reached>: l.publishBatch(batch) case nil: unprocessedLine = unprocessedLine[lineBytes:] default: // something bad happened :D }

}

That would certainly clean up some of these checked Add()s! I like the indication of the error and codification of the limit check in the batch itself, let me whip something up.

anusha-ragunathan · 2017-12-08T19:51:33Z

daemon/logger/awslogs/cloudwatchlogs.go

-func (l *logStream) publishBatch(events []wrappedEvent) {
-	if len(events) == 0 {
+func (l *logStream) publishBatch(batch *eventBatch) {
+	if batch.Count() == 0 {


What if batch.Count is 0, but batch.bytes is non-zero? Things could get out of whack. Probably better to Reset before return anyways.

samuelkarp · 2017-12-11T00:09:58Z

I'm not wild about how you've reorganized the code here. Previously, the code was intended to be ordered such that caller was above callee, or most-abstract to most-concrete. Now you have concrete implementation details (like the eventBatch type) above the parts of the code that are the core logic of the log driver (like the exported methods for logStream). Might be worth breaking things out into separate files at this point.

samuelkarp · 2017-12-11T00:15:46Z

daemon/logger/awslogs/cloudwatchlogs.go

+		}
+
+		switch batch.add(event, lineBytes) {
+		case errBatchCountTooLarge, errBatchSizeTooLarge:


Both of these errors result in the exact same code being run. Why not just return bool so you can do

if ok := batch.add(event, lineBytes); ok { unprocessedLine = unprocessedLine[lineBytes:] } else { l.publishBatch(batch) }

There's certainly simplicity in the bool approach, but I think that having the result codified and explicitly handled is more clearly read. I prefer to know that the there's a reason behind not adding the event, not just that it was added or not and the overhead to do so is negligible.

You're not recording that anywhere though. When I was reading the code, the different errors confused me because I didn't know how they were handled differently. Since they're in fact handled the same way and we don't ever do anything with the reason it seems like an unnecessary distinction.

You're right, either way we'll be handling it and arriving at the same result. This is a YAGNI situation, I'll drop this.

samuelkarp · 2017-12-11T00:17:10Z

daemon/logger/awslogs/cloudwatchlogs.go

+	// Reset occurs regardless of error. The payload itself may be
+	// causing issues and would otherwise prevent more events from being
+	// sent.
+	batch.Reset()


Should batch.Reset() happen here or in processEvent after the call to publishBatch? My vote would probably be for the latter since processEvent is what owns the eventBatch.

This would make it clearer that we're reusing the object if we lift the call up as you suggest - note that the batch is owned by collectBatch where additional call sites would exist. This change would mean that each publishBatch() must be followed by batch.Reset(), otherwise duplicate batch of events will be put into CloudWatch. It's simpler to leave it in publishBatch, but this hides its side effect; I'm going to add the additional calls to batch.Reset() unless you think that doesn't make sense or if you had something else in mind.

A different model (though I haven't thought about it enough to know whether I like the idea) is to make a new method batch.Publish() that is responsible for both publishing and resetting the internal buffer.

I had a similar thought, but I dismissed on the grounds of keeping the submission of the events in the log driver and out of the eventBatch container type. I'd rather have the submission responsibility on the log driver for now.

anusha-ragunathan · 2017-12-11T19:34:52Z

DockerSwarmSuite.TearDownTest fails to shutdown the daemon.

`FAIL: check_test.go:366: DockerSwarmSuite.TearDownTest
00:49:43 
00:49:43 check_test.go:371:
00:49:43     d.Stop(c)
00:49:43 daemon/daemon.go:395:
00:49:43     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:49:43 ... Error: Error while stopping the daemon d7f3fe1f6d643 : exit status 130
00:49:43 ```


I notice this in the logs:


```time="2017-12-09T00:49:36.542132074Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
time="2017-12-09T00:49:36.542203730Z" level=error msg="failed to shut down cluster node: context deadline exceeded"
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/pkg/signal/trap.go:83 +0xc5
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420850000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:365 +0x1c2
main.(*DaemonCli).start(0xc4204c4c60, 0xc4204c6a10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:314 +0x1b3b
main.runDaemon(0xc4204c6a10, 0xc42033eb40, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:78 +0x76
main.newDaemonCommand.func1(0xc4204646c0, 0xc420135d00, 0x0, 0x10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x5b
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0xc4204646c0, 0xc420010130, 0x10, 0x11, 0xc4204646c0, 0xc420010130)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:646 +0x44d
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4204646c0, 0x1accbc0, 0xc420464701, 0xc42047ce80)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:742 +0x30e
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4204646c0, 0xc42047ce80, 0xc420000260)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:695 +0x2b
main.main()
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:105 +0xe1```


There is a followup panic in `TestAPISwarmServicesUpdateStartFirst`, due to several instances of dockerd running.


```00:49:43 PANIC: docker_api_swarm_service_test.go:201: DockerSwarmSuite.TestAPISwarmServicesUpdateStartFirst
00:49:43 
00:49:43 [d7f3fe1f6d643] waiting for daemon to start
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 Attempt #2: daemon is still running with pid 9963
00:49:43 Attempt #3: daemon is still running with pid 9963
00:49:43 Attempt #4: daemon is still running with pid 9963
00:49:43 [d7f3fe1f6d643] exiting daemon
00:49:43 ... Panic: Fixture has panicked (see related PANIC)```

anusha-ragunathan · 2017-12-11T19:37:18Z

@jahkeup : Can you look into these test failures? Do they happen when you run integration tests locally?

anusha-ragunathan · 2017-12-12T19:17:36Z

@jahkeup : This patch has code rearrangement that's not necessary, as mentioned in #35726 (comment). Can you update the change such that only the code relevant to the fix is updated? We want to keep the code changes minimal, if and when possible. This also becomes important for considerations for backporting/crossporting to other releases.

jahkeup · 2017-12-12T19:49:31Z

@anusha-ragunathan yes I think that's prudent, and what I had in mind also. I will work with @samuelkarp offline to clean up and make some other improvements to this package.

I'm still trying to repro the above failures in the integration tests. I do have some failures, but they're different one tests from the ones observed on jenkins and they seem to be happening irregularly. I'll try rebasing to take in any test fixes to see if it makes a difference.

thaJeztah · 2017-12-12T20:49:45Z

The teardowntest itself sometimes fails, so not nescessarily related (haven't checked the full output)

The previous bytes counter was moved out of scope was not counting the total number of bytes in the batch. This type encapsulates the counter and the batch for consideration and code ergonomics. Signed-off-by: Jacob Vallejo <jakeev@amazon.com>

jahkeup · 2017-12-15T18:28:44Z

@anusha-ragunathan PTAL, I've updated the code and the tests are now passing.

samuelkarp

LGTM

samuelkarp · 2017-12-18T21:57:39Z

daemon/logger/awslogs/cloudwatchlogs.go

+// Warning: this type is not threadsafe and must not be used
+// concurrently. This type is expected to be consumed in a single go
+// routine and never concurrently.
+type eventBatch struct {


Can you move this down to be right above the methods for the type? Those start on line 636.

anusha-ragunathan · 2017-12-19T00:23:50Z

LGTM

thaJeztah

LGTM

left some suggestions for a follow up, but no showstoppers

thaJeztah · 2017-12-19T19:27:26Z

daemon/logger/awslogs/cloudwatchlogs.go

+		}
+
+		added := batch.add(event, lineBytes)
+		if added {


Nit; this could be (no need to change)

if ok := batch.add(event, lineBytes); ok {

thaJeztah · 2017-12-19T19:35:08Z

daemon/logger/awslogs/cloudwatchlogs.go

@@ -615,3 +620,70 @@ func unwrapEvents(events []wrappedEvent) []*cloudwatchlogs.InputLogEvent {
 	}
 	return cwEvents
 }
+
+func newEventBatch() *eventBatch {


Perhaps in a future rewrite we could pass in maximumLogEventsPerPut and maximumBytesPerPut (they feel like a property of the eventsBatch);

func newEventBatch(maxEvents uint, maxBytes uint) *eventBatch {

thaJeztah · 2017-12-19T19:36:30Z

daemon/logger/awslogs/cloudwatchlogs.go

+}
+
+func (b *eventBatch) isEmpty() bool {
+	zeroEvents := b.count() == 0


nit: these variables are a bit redundant;

return b.count() == 0 && b.size() == 0

thaJeztah · 2017-12-19T19:43:09Z

daemon/logger/awslogs/cloudwatchlogs.go

-		}
-		events = append(events, wrappedEvent{
+
+		event := wrappedEvent{


Wrapping and Unwrapping could be done in batch.add() / batch.events() as they're really related to how the batch stores the events (i.e. just for preserving the events order)

GordonTheTurtle added the status/0-triage label Dec 7, 2017

jahkeup mentioned this pull request Dec 7, 2017

awslogs batch size exceeds service limits #35725

Closed

thaJeztah added area/logging status/2-code-review and removed status/0-triage labels Dec 7, 2017

jahkeup force-pushed the awslogs-batching branch from 7791ffe to 5ad88a9 Compare December 7, 2017 00:21

jahkeup force-pushed the awslogs-batching branch from 5ad88a9 to 57874ae Compare December 7, 2017 23:44

thaJeztah added rebuild/janky labels Dec 8, 2017

GordonTheTurtle removed rebuild/janky labels Dec 8, 2017

jahkeup force-pushed the awslogs-batching branch from 57874ae to 9e92e03 Compare December 8, 2017 17:56

thaJeztah reviewed Dec 8, 2017

View reviewed changes

anusha-ragunathan reviewed Dec 8, 2017

View reviewed changes

thaJeztah reviewed Dec 8, 2017

View reviewed changes

anusha-ragunathan reviewed Dec 8, 2017

View reviewed changes

samuelkarp reviewed Dec 11, 2017

View reviewed changes

jahkeup force-pushed the awslogs-batching branch from 0b4df70 to ad14dbf Compare December 14, 2017 00:04

samuelkarp approved these changes Dec 18, 2017

View reviewed changes

thaJeztah approved these changes Dec 19, 2017

View reviewed changes

thaJeztah merged commit c8f7f44 into moby:master Dec 19, 2017

anusha-ragunathan mentioned this pull request Dec 20, 2017

[17.12] awslogs: Use batching type for ergonomics and correct counting docker/docker-ce#368

Merged

Fix awslogs batch size calculation #35726

Fix awslogs batch size calculation #35726

Conversation

jahkeup commented Dec 7, 2017 • edited by thaJeztah

thaJeztah commented Dec 7, 2017

jahkeup commented Dec 7, 2017

jahkeup commented Dec 7, 2017

jahkeup commented Dec 8, 2017

thaJeztah commented Dec 8, 2017

thaJeztah commented Dec 8, 2017

jahkeup commented Dec 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anusha-ragunathan Dec 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahkeup Dec 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samuelkarp commented Dec 11, 2017

Choose a reason for hiding this comment

jahkeup Dec 12, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anusha-ragunathan commented Dec 11, 2017 • edited

anusha-ragunathan commented Dec 11, 2017

anusha-ragunathan commented Dec 12, 2017

jahkeup commented Dec 12, 2017

thaJeztah commented Dec 12, 2017

jahkeup commented Dec 15, 2017

samuelkarp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anusha-ragunathan commented Dec 19, 2017

thaJeztah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahkeup commented Dec 7, 2017 •

edited by thaJeztah

anusha-ragunathan Dec 8, 2017 •

edited

jahkeup Dec 8, 2017 •

edited

jahkeup Dec 12, 2017 •

edited

anusha-ragunathan commented Dec 11, 2017 •

edited