Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix awslogs batch size calculation #35726

Merged
merged 1 commit into from Dec 19, 2017
Merged

Conversation

jahkeup
Copy link
Contributor

@jahkeup jahkeup commented Dec 7, 2017

Fixes #35725

- What I did

  • Added a type to encapsulate the batch and its associated size counter.
  • Cleaned up types and declarations for code cleanliness

- How I did it

The added type holds the batching mechanism and the counter used for determining whether or not to submit a batch of log events.

- How to verify it

The change can be verified by using the steps from #35725 to reproduce the issue and then comparing it to a build from this changeset. The appropriate test and test mock have been updated to validate this behavior.

- Description for the changelog
awslogs: fix batch size calculation for large logs

- A picture of a cute animal (not mandatory but encouraged)
https://www.reddit.com/r/aww/comments/7hzq4m

cc @samuelkarp @adnxn

@thaJeztah
Copy link
Member

ping @anusha-ragunathan PTAL

@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 7, 2017

I'm investigating these errors now, I didn't have these same errors when running tests locally.

@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 7, 2017

There's a race condition with the mock and the tests, I'm going to continue to narrow it down and fix it.

@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 8, 2017

Looks like jenkins failed to get janky and powerpc tests up and running. Can we restart those tests?

@thaJeztah
Copy link
Member

Restarted 👍

@thaJeztah
Copy link
Member

@jahkeup if you need to make additional changes, can you remove the fixes #xxx from the commit message, and move it to the PR's description? GitHub can become noisy if such commits are cherry-picked or people fork the repo and do merges etc. (I know, I know) 😅

@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 8, 2017

@thaJeztah no problem! I've removed the 'fixes' reference and made a mental note to not include that in the future 👍

@anusha-ragunathan PTAL, I tested the latest set of changes locally with ~500 rounds to see if the test flaked out - this seems stable now and is passing tests.

func (b *eventBatch) Events() []wrappedEvent {
// events in a batch must be sorted by timestamp
// see http://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html
sort.Sort(byTimestamp(b.events))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this undoing 443f251 (#24814)? /cc @samuelkarp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is respecting that, the code that stable sorts according to timestamp and then insertion order is implemented in the sort for this type (https://github.com/jahkeup/moby/blob/b0046c3e68579640b5b08a4cb8b754924fbc2a30/daemon/logger/awslogs/cloudwatchlogs.go#L631).

The insertion order is recorded as the wrappedEvent is crafted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! Sorry, it wasn't clear directly 😊 Do you think we need to change byTimestamp to something else? Alternatively, move the description of sorting to the GoDoc of the Events() method, and describe that it's sorting by timestamp, but preserving insertOrder if timestamp is equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the GoDoc for this method to clarify the action and results of retrieving the events from the batch.

The byTimestamp type could probably be better named for its usages, however I can't think of a more apt name without stringing together many many words. Do you have any suggestions on a different name? Do you think the GoDoc statement could instead be more clear without needing a name change?

// necessary overhead for an event to be logged.
func (b *eventBatch) Add(event wrappedEvent, size int) {
b.bytes = b.bytes + size + perEventBytes
b.events = append(b.events, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks thread unsafe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same case with Count and Size.

Copy link
Contributor Author

@jahkeup jahkeup Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This type isn't intended for concurrently use and isn't accessed concurrently as it stands - hence the unsafe operations. Do you have suggestions to improve this given the current implementation of its consumer? My goal here was to keep the type small and simple for this use case, but I am happy to add some mechanisms for safety.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need a mention in the function description; also given that it's only use in this file, we should probably un-export these methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This type is only used here so I'll unexport the methods. I'll also include a statement about the thread unsafe nature in each method's GoDoc line to make sure future developers understand this design assumption.

@@ -460,37 +411,30 @@ func (l *logStream) processEvent(events []wrappedEvent, unprocessedLine []byte,
}
line := unprocessedLine[:lineBytes]
unprocessedLine = unprocessedLine[lineBytes:]
if (len(events) >= maximumLogEventsPerPut) || (bytes+lineBytes+perEventBytes > maximumBytesPerPut) {
if (batch.Count() >= maximumLogEventsPerPut) || (batch.Size()+lineBytes+perEventBytes > maximumBytesPerPut) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud here; wondering if we could make batch.Add() reject the event if the Size or Count limit was reached, something like (dummy code);

err := batch.Add(event)

switch err := err.(type) {
case <maximum size reached>, <maximum logs reached>:
   l.publishBatch(batch)
case nil:
   unprocessedLine = unprocessedLine[lineBytes:]
default:
  // something bad happened :D
}

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would certainly clean up some of these checked Add()s! I like the indication of the error and codification of the limit check in the batch itself, let me whip something up.

func (l *logStream) publishBatch(events []wrappedEvent) {
if len(events) == 0 {
func (l *logStream) publishBatch(batch *eventBatch) {
if batch.Count() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if batch.Count is 0, but batch.bytes is non-zero? Things could get out of whack. Probably better to Reset before return anyways.

@samuelkarp
Copy link
Member

I'm not wild about how you've reorganized the code here. Previously, the code was intended to be ordered such that caller was above callee, or most-abstract to most-concrete. Now you have concrete implementation details (like the eventBatch type) above the parts of the code that are the core logic of the log driver (like the exported methods for logStream). Might be worth breaking things out into separate files at this point.

}

switch batch.add(event, lineBytes) {
case errBatchCountTooLarge, errBatchSizeTooLarge:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these errors result in the exact same code being run. Why not just return bool so you can do

if ok := batch.add(event, lineBytes); ok {
	unprocessedLine = unprocessedLine[lineBytes:]
} else {
	l.publishBatch(batch)
}

Copy link
Contributor Author

@jahkeup jahkeup Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's certainly simplicity in the bool approach, but I think that having the result codified and explicitly handled is more clearly read. I prefer to know that the there's a reason behind not adding the event, not just that it was added or not and the overhead to do so is negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not recording that anywhere though. When I was reading the code, the different errors confused me because I didn't know how they were handled differently. Since they're in fact handled the same way and we don't ever do anything with the reason it seems like an unnecessary distinction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, either way we'll be handling it and arriving at the same result. This is a YAGNI situation, I'll drop this.

// Reset occurs regardless of error. The payload itself may be
// causing issues and would otherwise prevent more events from being
// sent.
batch.Reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should batch.Reset() happen here or in processEvent after the call to publishBatch? My vote would probably be for the latter since processEvent is what owns the eventBatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it clearer that we're reusing the object if we lift the call up as you suggest - note that the batch is owned by collectBatch where additional call sites would exist. This change would mean that each publishBatch() must be followed by batch.Reset(), otherwise duplicate batch of events will be put into CloudWatch. It's simpler to leave it in publishBatch, but this hides its side effect; I'm going to add the additional calls to batch.Reset() unless you think that doesn't make sense or if you had something else in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different model (though I haven't thought about it enough to know whether I like the idea) is to make a new method batch.Publish() that is responsible for both publishing and resetting the internal buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a similar thought, but I dismissed on the grounds of keeping the submission of the events in the log driver and out of the eventBatch container type. I'd rather have the submission responsibility on the log driver for now.

@anusha-ragunathan
Copy link
Contributor

anusha-ragunathan commented Dec 11, 2017

DockerSwarmSuite.TearDownTest fails to shutdown the daemon.

`FAIL: check_test.go:366: DockerSwarmSuite.TearDownTest
00:49:43 
00:49:43 check_test.go:371:
00:49:43     d.Stop(c)
00:49:43 daemon/daemon.go:395:
00:49:43     t.Fatalf("Error while stopping the daemon %s : %v", d.id, err)
00:49:43 ... Error: Error while stopping the daemon d7f3fe1f6d643 : exit status 130
00:49:43 ```


I notice this in the logs:


```time="2017-12-09T00:49:36.542132074Z" level=error msg="agent failed to clean up assignments" error="context deadline exceeded"
time="2017-12-09T00:49:36.542203730Z" level=error msg="failed to shut down cluster node: context deadline exceeded"
goroutine 1 [running]:
github.com/docker/docker/pkg/signal.DumpStacks(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/docker/docker/pkg/signal/trap.go:83 +0xc5
github.com/docker/docker/daemon/cluster.(*Cluster).Cleanup(0xc420850000)
	/go/src/github.com/docker/docker/daemon/cluster/cluster.go:365 +0x1c2
main.(*DaemonCli).start(0xc4204c4c60, 0xc4204c6a10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:314 +0x1b3b
main.runDaemon(0xc4204c6a10, 0xc42033eb40, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:78 +0x76
main.newDaemonCommand.func1(0xc4204646c0, 0xc420135d00, 0x0, 0x10, 0x0, 0x0)
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x5b
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).execute(0xc4204646c0, 0xc420010130, 0x10, 0x11, 0xc4204646c0, 0xc420010130)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:646 +0x44d
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4204646c0, 0x1accbc0, 0xc420464701, 0xc42047ce80)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:742 +0x30e
github.com/docker/docker/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4204646c0, 0xc42047ce80, 0xc420000260)
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:695 +0x2b
main.main()
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:105 +0xe1```


There is a followup panic in `TestAPISwarmServicesUpdateStartFirst`, due to several instances of dockerd running.


```00:49:43 PANIC: docker_api_swarm_service_test.go:201: DockerSwarmSuite.TestAPISwarmServicesUpdateStartFirst
00:49:43 
00:49:43 [d7f3fe1f6d643] waiting for daemon to start
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 
00:49:43 [d7f3fe1f6d643] daemon started
00:49:43 Attempt #2: daemon is still running with pid 9963
00:49:43 Attempt #3: daemon is still running with pid 9963
00:49:43 Attempt #4: daemon is still running with pid 9963
00:49:43 [d7f3fe1f6d643] exiting daemon
00:49:43 ... Panic: Fixture has panicked (see related PANIC)```

@anusha-ragunathan
Copy link
Contributor

@jahkeup : Can you look into these test failures? Do they happen when you run integration tests locally?

@anusha-ragunathan
Copy link
Contributor

@jahkeup : This patch has code rearrangement that's not necessary, as mentioned in #35726 (comment). Can you update the change such that only the code relevant to the fix is updated? We want to keep the code changes minimal, if and when possible. This also becomes important for considerations for backporting/crossporting to other releases.

@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 12, 2017

@anusha-ragunathan yes I think that's prudent, and what I had in mind also. I will work with @samuelkarp offline to clean up and make some other improvements to this package.

I'm still trying to repro the above failures in the integration tests. I do have some failures, but they're different one tests from the ones observed on jenkins and they seem to be happening irregularly. I'll try rebasing to take in any test fixes to see if it makes a difference.

@thaJeztah
Copy link
Member

The teardowntest itself sometimes fails, so not nescessarily related (haven't checked the full output)

The previous bytes counter was moved out of scope was not counting the
total number of bytes in the batch. This type encapsulates the counter
and the batch for consideration and code ergonomics.

Signed-off-by: Jacob Vallejo <jakeev@amazon.com>
@jahkeup
Copy link
Contributor Author

jahkeup commented Dec 15, 2017

@anusha-ragunathan PTAL, I've updated the code and the tests are now passing.

Copy link
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// Warning: this type is not threadsafe and must not be used
// concurrently. This type is expected to be consumed in a single go
// routine and never concurrently.
type eventBatch struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this down to be right above the methods for the type? Those start on line 636.

@anusha-ragunathan
Copy link
Contributor

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

left some suggestions for a follow up, but no showstoppers

}

added := batch.add(event, lineBytes)
if added {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; this could be (no need to change)

if ok := batch.add(event, lineBytes); ok {

@@ -615,3 +620,70 @@ func unwrapEvents(events []wrappedEvent) []*cloudwatchlogs.InputLogEvent {
}
return cwEvents
}

func newEventBatch() *eventBatch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in a future rewrite we could pass in maximumLogEventsPerPut and maximumBytesPerPut (they feel like a property of the eventsBatch);

func newEventBatch(maxEvents uint, maxBytes uint) *eventBatch { 

}

func (b *eventBatch) isEmpty() bool {
zeroEvents := b.count() == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these variables are a bit redundant;

return b.count() == 0 && b.size() == 0

}
events = append(events, wrappedEvent{

event := wrappedEvent{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping and Unwrapping could be done in batch.add() / batch.events() as they're really related to how the batch stores the events (i.e. just for preserving the events order)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

awslogs batch size exceeds service limits
5 participants