CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

ivoanjo · 2021-09-29T09:18:34Z

In #1172 when we added CRubyTraceBuffer we documented that

Lines 179 to 180 in 148dfae

    
           # Under extreme concurrency scenarios, this class can exceed 
        
           # its +max_size+ by up to 4%.

I've found that this class can exceed the 4% after it was pointed out as a flaky test in CI.

On a closer look, the issue with the current implementation is that while it's safe to operate on an array concurrently on MRI, the current way in which we decide to add or replace and item in the buffer is not atomic with the actual operation.

This means that this simple change:

    def full?
      decision = @max_size > 0 && @items.length >= @max_size
      sleep 0.01
      decision
    end

easily triggers test failures and goes over the expected 4%

    #push
      does not have collisions
      with items exceeding maximum size
        does not exceed expected maximum size (FAILED - 1)

Failures:

  1) Datadog::CRubyTraceBuffer behaves like thread-safe buffer #push with items exceeding maximum size does not exceed expected maximum size
     Failure/Error: expect(output).to have_at_most(max_size * max_size_leniency).items
       expected at most 104.0 items, got 199
     Shared Example Group: "thread-safe buffer" called from ./spec/ddtrace/buffer_spec.rb:756
     # ./spec/ddtrace/buffer_spec.rb:364:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:205:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:97:in `block (2 levels) in <top (required)>'
     # /Users/ivo.anjo/.rvm/gems/ruby-2.7.4/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'

Finished in 12.55 seconds (files took 0.83056 seconds to load)
5 examples, 1 failure

Failed examples:

rspec ./spec/ddtrace/buffer_spec.rb[6:2:1:2:1] # Datadog::CRubyTraceBuffer behaves like thread-safe buffer #push with items exceeding maximum size does not exceed expected maximum size

In terms of correctness, I think we could fix this by executing a slice! operation after we add an item, to ensure that the array did not grow, something along the lines of...

[4] pry(main)> arr = (1..10).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[5] pry(main)> arr.slice!(8..)
=> [9, 10]
[6] pry(main)> arr
=> [1, 2, 3, 4, 5, 6, 7, 8]

...but I decided to open this issue instead of a PR because we may want to take a different approach.

The text was updated successfully, but these errors were encountered:

marcotc · 2021-09-29T18:27:32Z

CRubyTraceBuffer was originally intended to accept lax constraint requirements for performance gains.

I think the main issue here is there there's a way for the internal array to grow, but never to shrink, meaning over time it can slowly grow (if there are consistently more than 1000 traces/sec).

Adding slice! or any other operation to Buffer#add! will measurably affect performance, as this is the most common path taken and currently only performs a @items << item operation.

I suggest we add a way to shrink the array during Buffer#replace!, in case it's over sized. This means the shrink operation will only trigger when full? returns true. What do you think?

ivoanjo · 2021-10-04T08:47:06Z

I suggest we add a way to shrink the array during Buffer#replace!, in case it's over sized. This means the shrink operation will only trigger when full? returns true. What do you think?

I guess that would work, but I'm not entirely convinced that a slice! on the fast path is that expensive, especially given the common case is "nothing needs to be done".

E.g. a simple benchmark on my machine shows:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'benchmark-ips'
end

require 'benchmark/ips'

THE_ARRAY = [1, 2, 3, 4, 5]

Benchmark.ips do |x|
  x.report(RUBY_DESCRIPTION) { THE_ARRAY.slice!(10...) }
  x.compare!
end

ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin19]
                          9.086M (± 2.9%) i/s -     46.032M in   5.070565s

On the "slow path" (e.g. slice does need to happen) it would probably be a bit slower BUT if the code correctly had gone for a replace rather than a add, the slice would happen anyway so I claim that accepting the same trade-off seems reasonable.

marcotc · 2021-10-04T22:08:34Z

I was thinking about this and slice! cannot be done on Buffer#add! because slice! will always remove an element from on side of the array (if I'm thinking @items.slice!(0, @max_size)).

Because we concatenate at the end of the internal buffer, this means we are always dropping the most recent elements on slice!, which impacts fair sampling.

There are other options (do not slice at the end of buffer, or insert new elements not at the end of buffer), but they make the operations much more costly on a non-full buffer.

I'm thinking that we need to address this on replace!, to ensure we are handling the fair sampling correctly.

ivoanjo · 2021-10-05T08:00:54Z

Good point, indeed slice! would drop recent events, rather than at random.

marcotc mentioned this issue Oct 7, 2021

Ensure CRubyBuffer respects max_size #1715

Merged

marcotc closed this as completed in #1715 Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

ivoanjo commented Sep 29, 2021

marcotc commented Sep 29, 2021

ivoanjo commented Oct 4, 2021

marcotc commented Oct 4, 2021

ivoanjo commented Oct 5, 2021

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

Comments

ivoanjo commented Sep 29, 2021

marcotc commented Sep 29, 2021

ivoanjo commented Oct 4, 2021

marcotc commented Oct 4, 2021

ivoanjo commented Oct 5, 2021