Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

Closed
ivoanjo opened this issue Sep 29, 2021 · 4 comments · Fixed by #1715
Closed

CRubyTraceBuffer can exceed its max size and that leads to flaky tests #1704

ivoanjo opened this issue Sep 29, 2021 · 4 comments · Fixed by #1715

Comments

@ivoanjo
Copy link
Member

ivoanjo commented Sep 29, 2021

In #1172 when we added CRubyTraceBuffer we documented that

# Under extreme concurrency scenarios, this class can exceed
# its +max_size+ by up to 4%.

I've found that this class can exceed the 4% after it was pointed out as a flaky test in CI.

On a closer look, the issue with the current implementation is that while it's safe to operate on an array concurrently on MRI, the current way in which we decide to add or replace and item in the buffer is not atomic with the actual operation.

This means that this simple change:

    def full?
      decision = @max_size > 0 && @items.length >= @max_size
      sleep 0.01
      decision
    end

easily triggers test failures and goes over the expected 4%

    #push
      does not have collisions
      with items exceeding maximum size
        does not exceed expected maximum size (FAILED - 1)

Failures:

  1) Datadog::CRubyTraceBuffer behaves like thread-safe buffer #push with items exceeding maximum size does not exceed expected maximum size
     Failure/Error: expect(output).to have_at_most(max_size * max_size_leniency).items
       expected at most 104.0 items, got 199
     Shared Example Group: "thread-safe buffer" called from ./spec/ddtrace/buffer_spec.rb:756
     # ./spec/ddtrace/buffer_spec.rb:364:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:205:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:97:in `block (2 levels) in <top (required)>'
     # /Users/ivo.anjo/.rvm/gems/ruby-2.7.4/gems/webmock-3.13.0/lib/webmock/rspec.rb:37:in `block (2 levels) in <top (required)>'

Finished in 12.55 seconds (files took 0.83056 seconds to load)
5 examples, 1 failure

Failed examples:

rspec ./spec/ddtrace/buffer_spec.rb[6:2:1:2:1] # Datadog::CRubyTraceBuffer behaves like thread-safe buffer #push with items exceeding maximum size does not exceed expected maximum size

In terms of correctness, I think we could fix this by executing a slice! operation after we add an item, to ensure that the array did not grow, something along the lines of...

[4] pry(main)> arr = (1..10).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[5] pry(main)> arr.slice!(8..)
=> [9, 10]
[6] pry(main)> arr
=> [1, 2, 3, 4, 5, 6, 7, 8]

...but I decided to open this issue instead of a PR because we may want to take a different approach.

@marcotc
Copy link
Member

marcotc commented Sep 29, 2021

CRubyTraceBuffer was originally intended to accept lax constraint requirements for performance gains.

I think the main issue here is there there's a way for the internal array to grow, but never to shrink, meaning over time it can slowly grow (if there are consistently more than 1000 traces/sec).

Adding slice! or any other operation to Buffer#add! will measurably affect performance, as this is the most common path taken and currently only performs a @items << item operation.

I suggest we add a way to shrink the array during Buffer#replace!, in case it's over sized. This means the shrink operation will only trigger when full? returns true. What do you think?

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 4, 2021

I suggest we add a way to shrink the array during Buffer#replace!, in case it's over sized. This means the shrink operation will only trigger when full? returns true. What do you think?

I guess that would work, but I'm not entirely convinced that a slice! on the fast path is that expensive, especially given the common case is "nothing needs to be done".

E.g. a simple benchmark on my machine shows:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'benchmark-ips'
end

require 'benchmark/ips'

THE_ARRAY = [1, 2, 3, 4, 5]

Benchmark.ips do |x|
  x.report(RUBY_DESCRIPTION) { THE_ARRAY.slice!(10...) }
  x.compare!
end
ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin19]
                          9.086M (± 2.9%) i/s -     46.032M in   5.070565s

On the "slow path" (e.g. slice does need to happen) it would probably be a bit slower BUT if the code correctly had gone for a replace rather than a add, the slice would happen anyway so I claim that accepting the same trade-off seems reasonable.

@marcotc
Copy link
Member

marcotc commented Oct 4, 2021

I was thinking about this and slice! cannot be done on Buffer#add! because slice! will always remove an element from on side of the array (if I'm thinking @items.slice!(0, @max_size)).

Because we concatenate at the end of the internal buffer, this means we are always dropping the most recent elements on slice!, which impacts fair sampling.

There are other options (do not slice at the end of buffer, or insert new elements not at the end of buffer), but they make the operations much more costly on a non-full buffer.

I'm thinking that we need to address this on replace!, to ensure we are handling the fair sampling correctly.

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 5, 2021

Good point, indeed slice! would drop recent events, rather than at random.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants