Fix max number of samples in experiments on non-64-bit systems. #528

aznashwan · 2022-03-21T12:47:34Z

The use of math.MaxInt64 as the maximum number of samples in the context
of general int variables can lead to inconsistent behavior on non-64-bit
systems.

This patch addresses this using the arbitrary-precision math/big.Int
type to store and compare sampling numbers to ensure no
architecture-dependant int size mismatches will occur.

Signed-off-by: Nashwan Azhari nazhari@cloudbasesolutions.com

aznashwan · 2022-03-21T13:41:42Z

Additional context: I ran into this issue after making this benchmarking-related PR on kubernetes-sigs/cri-tools, which has an automated GitHub action which builds that project's binaries which failed to compile.

Weirdly enough, this is not an issue on any other non-GitHub system I tested it on (namely x64-based Linux, Windows, and macOS), so I suspect it's related to either the images/hosts used for the GitHub runners, or the way the actions/setup-go GitHub action library configures the Go installation.

aznashwan · 2022-03-24T10:26:49Z

@onsi could I please get a quick lookthrough? (the code itself is trivial)

onsi · 2022-03-24T14:46:02Z

hey @aznashwan sorry for the delay. On the whole this looks fine; though I'm a bit surprised to see that this is an issue - are you actually intending to run math.MaxInt64 samples? I'd like to learn what the inconsistency you're seeing is.

Also - it's cool to see gmeasure being used in the wild; thanks for linking to your PR. I'm seeing that you've rolled your own data point aggregation and reporting - which is something gmeasure supports out of the box, so I'd love to learn more about what might be missing from gmeasure for your use case.

aznashwan · 2022-03-24T19:43:40Z

@onsi thank you so much for taking your time to check the linked PR as well!

I'd like to learn what the inconsistency you're seeing is.

The root cause of the issue is that gmeasure.SamplingConfig.N is defined as a plain int, whose bit size is platform-dependant.

Presuming we're on a 32-bit system, the maxN we directly compare with SamplingConfig.N will also have its type inferred to a plain int on 32 bits, in which case assigning maxN := math.MaxInt64 is impossible and will lead to the compilation error we're seeing in the cri-tools PR.

As a result, we need to ensure any comparisons between SamplingConfig.N(int) and maxN := MaxInt64 (int64) always happen correctly by:

changing SamplingConfig.N to a fixed-size type like int64 (which I believe is an unreasonable constraint of the language)
casting SamplingConfig.N to int64 before any comparisons to maxN (which would mean truncating it to 64bits in case we're on a platform where int is more than 64 bits)
using some arbitrary-precision numeric datatype (I went with math/big.Int but any other suggestions are welcome)

Side-note: although this issue should never manifest itself on any CPU >=64bits, as mentioned, it does consistently happen on the GitHub-hosted job runners (which I've double-checked and claim they're 64-bit OSes on x64 CPUs), so I suspect there's something special going on with the base images used for these runners or the way Go is configured within them.

I'm seeing that you've rolled your own data point aggregation and reporting - which is something gmeasure supports out of the box, so I'd love to learn more about what might be missing from gmeasure for your use case.

Seriously, thanks again for taking the time to parse the linked PR too!

TL;DR: we plan to also record system-related metrics (e.g. CPU/RAM/io etc) associated with each individual measurement, and we'd like to try to do it in a manner which would not interfere with the tests at all by having the system-related queries happen outside of the test itself. (hence why all the sending results to a channel on a different goroutine shenanigans)
These are things well outside of the scope of a BDD testing framework, plus we'll need it to be cross-platform (Windows included), so please don't worry about it.

Full context:

We're focusing on benchmarking lifecycle operation durations against CRI-compatible container runtimes, which basically means the CRUD operations for a given resource. (just containers and pods for now)
Our goal is to run lots of lifecycles in parallel and over a prolonged period of time to spot any degradations in performance while keeping track of the CRUD operations individually and which lifecycle they belonged to as well.

I had actually "inherited" the code which eventually became that PR from a colleague of mine who was initially using gmeasure.Stopwatch with a simple stopwatch.Record("C/R/U/D") call for each stage.

The problem was that if we'd run multiple tests in parallel, we would NOT have any grouping of the operation measurements. (i.e. the 3rd StartContainer duration in experiment.Get("StartContainer").Durations would NOT necessarily correspond to the 3rd value in experiment.Get("StatusContainer").Durations because individual C/R/U/D operation lengths might vary even between container lifecycles started at the same time)

I realize there were workarounds we could have done to still use gmeasure.Stopwatch and have the samples grouped somehow, but given the added requirement of querying system resources for each measurement I decided try to do our own thing. (still a work in progress for now)

The only real suggestion I could give is that I believe it would be useful if there was some sort of "namespacing of gmeasure.StopWatch" within individual iterations of experiments which would prevent results from iteration N of an experiment from intermingling with results from iteration M, but I honestly don't know how that might look implemented into the library and is probably too specific to our usecase for it to be worth your time anyway.

Many thanks again for everything!

onsi · 2022-03-25T13:51:31Z

The root cause of the issue is that gmeasure.SamplingConfig.N is defined as a plain int, whose bit size is platform-dependant.

Ah got it - thanks for clarifying. I think I was moving too quickly and didn't appreciate that this is, at the core, a compilation issue. We can go with the bigint approach in your PR and I'm happy to pull it in but since my intent was simply to say "set this to a big number" I think we could also use math.MaxInt32 - I believe that will avoid the compiler issue and be a simpler fix to the underlying problem without materially affecting the behavior of the library (I don't imagine people running billions of samples). WDYT?

It's super cool to see your use case - thanks for sharing it.

The only real suggestion I could give is that I believe it would be useful if there was some sort of "namespacing of gmeasure.StopWatch" within individual iterations of experiments which would prevent results from iteration N of an experiment from intermingling with results from iteration M, but I honestly don't know how that might look implemented into the library and is probably too specific to our usecase for it to be worth your time anyway.

I think that usecase is covered with the Annotation decorator but I may have misunderstood.

In any event, I don't want to block you. Let me know what you think about the MaxInt32 idea and we can finish this up and merge it in :)

The use of `math.MaxInt64` as the maximum number of samples in the context of general int variables can lead to inconsistent behavior on non-64-bit systems. This patch addresses this by using `math.MaxInt32` as the default sample size in case an explicit one is not provided in the `SamplingConfig`. Signed-off-by: Nashwan Azhari <nazhari@cloudbasesolutions.com>

aznashwan · 2022-03-25T14:58:39Z

@onsi thanks a lot for the follow-up!

I think we could also use math.MaxInt32 - I believe that will avoid the compiler issue and be a simpler fix to the underlying problem without materially affecting the behavior of the library (I don't imagine people running billions of samples).

You're absolutely right, I've just reverted the PR to simply using math.MaxInt32 and can confirm this fixes our compilation issue on the GitHub runners. (ignore the failed /vendor lint check, the releases workflows are the ones actually compiling)

I think that usecase is covered with the Annotation decorator but I may have misunderstood.

Great call, using those annotations to "tag" values with the lifecycle they represent (e.g. "container0/1/2") would have achieved exactly what I was looking for.
As mentioned though, the added requirements on our end would have probably lead to a custom solution being implemented anyway, but thanks for pointing it out, it'll surely come in handy in the future!

onsi · 2022-03-25T14:59:38Z

sweet! thanks for your patience. will pull this in and cut a new release soon

aznashwan · 2022-03-25T15:05:43Z

Thank you so much as well for all your interest in how the library is used, you really went above and beyond!

Hope to be contributing back something more relevant to the project some day soon!

onsi · 2022-03-25T18:58:30Z

❤️thanks @aznashwan !

aznashwan mentioned this pull request Mar 21, 2022

Update container/pod benchmarking procedures. kubernetes-sigs/cri-tools#894

Merged

aznashwan force-pushed the sampling-config-n-int64 branch from 13b6c1e to c2b3449 Compare March 25, 2022 14:57

onsi merged commit 1c84497 into onsi:master Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix max number of samples in experiments on non-64-bit systems. #528

Fix max number of samples in experiments on non-64-bit systems. #528

aznashwan commented Mar 21, 2022

Uh oh!

aznashwan commented Mar 21, 2022

Uh oh!

aznashwan commented Mar 24, 2022

Uh oh!

onsi commented Mar 24, 2022

Uh oh!

aznashwan commented Mar 24, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!

aznashwan commented Mar 25, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!

aznashwan commented Mar 25, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!

Fix max number of samples in experiments on non-64-bit systems. #528

Fix max number of samples in experiments on non-64-bit systems. #528

Conversation

aznashwan commented Mar 21, 2022

Uh oh!

aznashwan commented Mar 21, 2022

Uh oh!

aznashwan commented Mar 24, 2022

Uh oh!

onsi commented Mar 24, 2022

Uh oh!

aznashwan commented Mar 24, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!

aznashwan commented Mar 25, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!

aznashwan commented Mar 25, 2022

Uh oh!

onsi commented Mar 25, 2022

Uh oh!