Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discarded logs on chunk upload failure #2724

Conversation

anderseknert
Copy link
Member

If the size of the decision logs buffered exceeeds that of upload_size_limit_bytes, the upload will be split into chunks. If one of the attempted uploads return with an error the chunk is stored in the "new" buffer and will be re-attempted at the next invocation of the oneShot method. However, once that is done the function returns, leaving any decisions left in the buffer to be discarded.

I've reproduced this by queuing up decisions in OPA above upload_size_limit_bytes and having an external mock server on the receiving end which randomly responds with non-OK status codes. Not having written too much golang (yet!) I could however use some guidance in porting these to the OPA test suite.

As for the fix itself - an alternative (and perhaps more sensible?) approach would be to stop uploading at first error but to push all remaining chunks in the old buffer into the new one to be preserved for next invocation.

Signed-off-by: Anders Eknert anders.eknert@bisnode.com

@anderseknert
Copy link
Member Author

@patrick-east I think I have a basic idea on how to test this. Something like:

  1. Start the fixture server configured with upload_size_limit_bytes at a low value enough to only send one decision log entry at a time. The server conf is currently hard coded so this would probably need to be changed to take the config as a parameter or something, as we'd only want that conf for this test.
  2. Not sure how to get the size of the buffer here as the Len() method from fixture.plugin.buffer always seem to return 0, even just after the Log method was called?
  3. So anyway, knowing that only one message will be uploaded at a time, simply log two messages and then call oneShot - with the server configured to return an error - this should have the failing message requeued and the message left in the buffer discarded - or rather of course not discarded with the fix in place. This again should just be checking the length of the buffer if that's possible.

I'll try my best to get a basic test following the above steps done this weekend - if you have any pointers on that strategy that's much appreciated.

@patrick-east
Copy link
Contributor

The fix looks good to me!

For the test that sounds like a good plan. I'm not sure why checking the buffer length isn't working, a quick look through the code makes it seem like it should work as expected. Was the buffer maybe flushed between calling Log() and checking Len()?

@anderseknert
Copy link
Member Author

anderseknert commented Sep 25, 2020

I'm not sure why checking the buffer length isn't working, a quick look through the code makes it seem like it should work as expected. Was the buffer maybe flushed between calling Log() and checking Len()?

Not sure either - this always fails for me as the Len() comes back with 0. Does it work for you, or can you spot a rookie mistake somewhere? :)

func TestPluginRequeSomeMore(t *testing.T) {
	ctx := context.Background()

	fixture := newTestFixture(t)
	defer fixture.server.stop()

	fixture.server.ch = make(chan []EventV1, 1)

	var input interface{} = map[string]interface{}{"method": "GET"}
	var result1 interface{} = false

	fixture.plugin.Log(ctx, &server.Info{
		DecisionID: "abc",
		Path:       "data.foo.bar",
		Input:      &input,
		Results:    &result1,
		RemoteAddr: "test",
		Timestamp:  time.Now().UTC(),
	})

	bufLen := fixture.plugin.buffer.Len()
	if bufLen != 1 {
		t.Fatalf("expected 1 log in the buffer, found %v", bufLen)
	}
}

@anderseknert anderseknert force-pushed the Fix-discarded-logs-on-chunk-upload-fail branch from 2ed61e2 to 4d234ca Compare September 26, 2020 22:05
@anderseknert
Copy link
Member Author

@patrick-east alright - not sure why the Len() method returned 0 before - seems to be working now. Have added a test and it passes running make on my macbook, but on the CI server it fails due to "Log encoding failed: upload chunk size too small.". Admittedly the test currently feels somewhat flaky with upload_size_limit_bytes set to a fixed value (222 bytes), a value which I picked from just inspecting the size of the log message in my debugger. I'm guessing this might differ between platforms depending on the gzip encoder, and that's what's going on here? Or maybe something else entirely 😄 I'm hoping I have some more time tomorrow to dig into this.

@anderseknert anderseknert force-pushed the Fix-discarded-logs-on-chunk-upload-fail branch from 1ace20f to 6a81f4a Compare September 27, 2020 07:21
@anderseknert
Copy link
Member Author

Oh, the size reported there was even before gzip encoding, so how that would differ between platforms is even more mysterious. I just upped the upload_size_limit_bytes to 300 to get some margin and now the test passes on CI as well. All good then :)

@anderseknert anderseknert changed the title DRAFT: Fix discarded logs on chunk upload failure Fix discarded logs on chunk upload failure Sep 28, 2020
Copy link
Contributor

@patrick-east patrick-east left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one comment for the fixture config

@@ -791,7 +840,7 @@ type testFixture struct {
server *testServer
}

func newTestFixture(t *testing.T) testFixture {
func newTestFixture(t *testing.T, pluginConfigRaw string) testFixture {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we might want to change this option to be more of a like overlay or additional config. Making the callers provide {"service": "example"} is kind of of hassle, especially when that service is being defined automatically in this helper. Any change to it may require the callers to update.

A pattern we use a bunch in OPA is to allow for optionally passing in config functions like:

type testPluginCustomizer func(c *Config)

and then to customize the config for a test have something like:

func newTestFixture(t *testing.T, options ...testPluginCustomizer) testFixture {
	...
	config, _ := ParseConfig([]byte(pluginConfig), manager.Services(), nil)
	for _, option := range options {
		option(config)
	}
	...
}

called with something like:

	fixture := newTestFixture(t, func(c *Config) {
		c.Reporting.UploadSizeLimitBytes = 300
	})

All the other callers can skip providing anything so they stay as newTestFixture(t)

The only potential downside of that approach is that it skips the validateAndInjectDefaults step, but I'm guessing thats OK for unit tests where the test author is setting very explicit options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice! I was reaching for something like optional/default parameters for this exactly but came up blank - I've never done much golang except for some small fixes here and there, so learning patterns like this is awesome. I'll update the PR with that once I get a few moments to spare sometime this week. Thanks @patrick-east 👍

@anderseknert
Copy link
Member Author

Pushed a new commit addressing the optional config as suggested by @patrick-east. Also modified the chunk upload failure handling to stop the upload attempts at first failure and just preserve whatever is in the buffer. We'll instead let the caller decide if a retry should be done immediately or if it can wait until the next configured interval has passed. Thanks for your input on that @srenatus!

patrick-east
patrick-east previously approved these changes Sep 30, 2020
@patrick-east
Copy link
Contributor

Changes LGTM, go ahead and squash the commits then we should be ready to merge 👍

If the size of the decision logs buffered exceeeds that of `upload_size_limit_bytes`, the upload will be split into chunks. If one of the attempted uploads return with an error the chunk is stored in the "new" buffer and will be re-attempted at the next invocation of the `oneShot` method. However, once that is done the function returns, leaving any decisions left in the buffer to be discarded. This PR fixes that, preserving all failed chunk uploads to the buffer.

Signed-off-by: Anders Eknert <anders.eknert@bisnode.com>
@anderseknert anderseknert force-pushed the Fix-discarded-logs-on-chunk-upload-fail branch from 94bed31 to ad380b4 Compare September 30, 2020 22:00
@patrick-east patrick-east merged commit 777b31d into open-policy-agent:master Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants