Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cloning of job payload #4303

Merged
merged 1 commit into from Oct 2, 2019
Merged

Conversation

fatkodima
Copy link
Member

Profiling bin/sidekiqload, I got:

For cpu:
Before:

==================================
  Mode: cpu(1000)
  Samples: 15951 (1.51% miss rate)
  GC: 1804 (11.31%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
     10483  (65.7%)       10483  (65.7%)     Hiredis::Ext::Connection#read
      2226  (14.0%)        2226  (14.0%)     Sidekiq::Processor#cloned
      1804  (11.3%)        1804  (11.3%)     (garbage collection)
       704   (4.4%)         704   (4.4%)     #<Module:0x00007fd0ae32f718>.parse
       137   (0.9%)         107   (0.7%)     Sidekiq::LoggingUtils#add
        75   (0.5%)          75   (0.5%)     Redis::Connection::Hiredis#write
        67   (0.4%)          67   (0.4%)     Sidekiq::Processor#constantize
        59   (0.4%)          59   (0.4%)     Redis::Connection::Hiredis#timeout=
     10792  (67.7%)          32   (0.2%)     ConnectionPool#with

After:

==================================
  Mode: cpu(1000)
  Samples: 14098 (1.34% miss rate)
  GC: 1528 (10.84%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      9625  (68.3%)        9625  (68.3%)     Hiredis::Ext::Connection#read
      1528  (10.8%)        1528  (10.8%)     (garbage collection)
      1281   (9.1%)        1281   (9.1%)     Sidekiq::Processor#deep_dup
       656   (4.7%)         656   (4.7%)     #<Module:0x00007f94bcbeb768>.parse
       187   (1.3%)         127   (0.9%)     Sidekiq::LoggingUtils#add
        79   (0.6%)          79   (0.6%)     Redis::Connection::Hiredis#write
        68   (0.5%)          68   (0.5%)     Sidekiq::Processor#constantize
      9834  (69.8%)          62   (0.4%)     Redis#_bpop

So, 14.0 - 9.1 = 5% ⚡️ cpu savings

For object allocations:
Before:

==================================
  Mode: object(1)
  Samples: 8206264 (0.00% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
   2300072  (28.0%)     2300072  (28.0%)     Sidekiq::Processor#cloned
   1500042  (18.3%)     1500042  (18.3%)     #<Module:0x00007fc3751fb568>.parse
   4000164  (48.7%)      900028  (11.0%)     Sidekiq::Processor#dispatch
   2304025  (28.1%)      500591   (6.1%)     ConnectionPool#with
    400898   (4.9%)      400898   (4.9%)     Redis::Connection::Hiredis#write
   1200326  (14.6%)      400052   (4.9%)     Redis#_bpop

After:

==================================
  Mode: object(1)
  Samples: 7105331 (0.00% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
   1500042  (21.1%)     1500042  (21.1%)     #<Module:0x00007ff923367648>.parse
   1200036  (16.9%)     1200036  (16.9%)     Sidekiq::Processor#deep_dup
   2900112  (40.8%)      900028  (12.7%)     Sidekiq::Processor#dispatch
   2303422  (32.4%)      500491   (7.0%)     ConnectionPool#with
    400749   (5.6%)      400749   (5.6%)     Redis::Connection::Hiredis#write
   1200306  (16.9%)      400052   (5.6%)     Redis#_bpop

So, 28.0 - 16.9 = 11% memory savings.

Profiling with memory_profiler, I got more accurate results for memory:
Before:

Total allocated: 695737343 bytes (6904607 objects)
Total retained:  12702768 bytes (972 objects)

After:

Total allocated: 606804659 bytes (5704475 objects)
Total retained:  12699800 bytes (962 objects)

So, savings in bytes: (695737343 - 606804659) / 695737343.0 * 100 = 13% ⚡️
Savings in objects: (6904607 - 5704475) / 6904607.0 * 100 = 17.5% ⚡️

Another isolated benchmark:

# frozen_string_literal: true

require "benchmark/ips"

def deep_dup(obj)
  case obj
  when Integer, Float, TrueClass, FalseClass, NilClass
    return obj
  when String
    return obj.dup
  when Array
    duped = Array.new(obj.size)
    duped.each_with_index do |value, index|
      duped[index] = deep_dup(value)
    end
  when Hash
    duped = obj.dup
    duped.each_pair do |key, value|
      duped[key] = deep_dup(value)
    end
  else
    duped = obj.dup
  end

  duped
end

def deep_dup_without_case(obj)
  if Integer === obj || Float === obj || TrueClass === obj || FalseClass === obj || NilClass === obj
    return obj
  elsif String === obj
    return obj.dup
  elsif Array === obj
    duped = Array.new(obj.size)
    obj.each_with_index do |value, index|
      duped[index] = deep_dup_without_case(value)
    end
  elsif Hash === obj
    duped = obj.dup
    duped.each_pair do |key, value|
      duped[key] = deep_dup_without_case(value)
    end
  else
    duped = obj.dup
  end

  duped
end


obj = { "class" => "FooWorker", "args" => [1, 2, 3, "foobar"], "jid" => "123987123" }

Benchmark.ips do |x|
  x.report("marshal") do
    Marshal.load(Marshal.dump(obj))
  end

  x.report("deep_dup") do
    deep_dup(obj)
  end

  x.report("deep_dup_without_case") do
    deep_dup_without_case(obj)
  end

  x.compare!
end

Results:

Warming up --------------------------------------
             marshal    11.146k i/100ms
            deep_dup    13.970k i/100ms
deep_dup_without_case
                        16.752k i/100ms
Calculating -------------------------------------
             marshal    105.758k (±13.8%) i/s -    523.862k in   5.083462s
            deep_dup    178.748k (± 3.3%) i/s -    894.080k in   5.008318s
deep_dup_without_case
                        200.419k (± 2.0%) i/s -      1.005M in   5.017126s

Comparison:
deep_dup_without_case:   200419.0 i/s
            deep_dup:   178748.4 i/s - 1.12x  slower
             marshal:   105758.2 i/s - 1.90x  slower

So while it seems that if-else is a bit uglier than case/when, it is much faster.

@mperham
Copy link
Collaborator

mperham commented Oct 1, 2019 via email

@fatkodima
Copy link
Member Author

Previous optimizations were also basically optimized "hello world", but with less observable effect.
Sure, it is uglier than previous one-liner, but if spent minute or two studying this code, it seems straightforward and clear. At least for me. But in the end, you are the final decision maker.

@mperham
Copy link
Collaborator

mperham commented Oct 2, 2019

I guess I don’t know what the trade offs are and so I’m unsure. Why is it faster than Marshal? Will there be any compatibility issues?

I like the fact that we can simplify because we know that we only need to handle JSON types. Is that why it’s faster?

@fatkodima
Copy link
Member Author

Will there be any compatibility issues?

No, I don't see any compatibility issues.

Is that why it’s faster?

Yes. For deep-duping generic objects code will be a bit larger and uglier (because at least we should deal with object instance variables).

@mperham
Copy link
Collaborator

mperham commented Oct 2, 2019

All right, you talked me into it. Thanks for your patience!

@mperham mperham merged commit acdfc59 into sidekiq:master Oct 2, 2019
@mperham
Copy link
Collaborator

mperham commented Oct 2, 2019

I refactored things a bit and realized the code was calling cloned twice and I believe the second call is unnecessary, so even bigger win!

@eregon
Copy link
Contributor

eregon commented Oct 3, 2019

@fatkodima The fact that if/elsif is faster than case seems like performance bug of MRI, it should be exactly the same. Could you report it to https://bugs.ruby-lang.org/ ?

@fatkodima
Copy link
Member Author

@eregon Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.
If this is a bug in newer rubies, I'm unobtrusively delegating that to you as a more familiar with all that kitchen 😀

@schneems
Copy link

schneems commented Oct 3, 2019

Hello! I'm surprised the case statement is slower. In my experience MRI optimizes the crap out of them, do you have any idea why it is slower?

The fact that if/elsif is faster than case seems like a performance bug of MRI, it should be exactly the same.

then

Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.

From the video "the case/when does not have a call site cache". I'm not sure what internals are imposing that limitation but this being faster than case/when feels like a bug. Even in the bug "don't go changing your case/when statements to if statements, this is basically something we should fix" assuming he means ruby/ruby. He says that hes' got a patch, I wonder what ever happened to it.

Another Optimization - Recursive Hash

I optimized similar code in sprockets and I've got a killer pattern that I think you can reuse. Since you're comparing classes you can use the class as a key to a hash and then return a proc that can be executed. The procs can reference the same hash so it becomes a recursive data structure.

Hashes have an optimization that allows you to compare based essentially on object ID rather than hashing the value and since classes are singletons you can use it to get really good perf increase.

Here is a benchmark https://gist.github.com/schneems/3fb96b12d2f88d0700acc5dd7cb437ee, assuming my spike at the code is technically correct it is 1.17x faster than this PR:

       hash deep dup:   396849.2 i/s
deep_dup_without_case:   338379.3 i/s - 1.17x  slower
            deep_dup:   285229.5 i/s - 1.39x  slower
             marshal:   148201.7 i/s - 2.68x  slower

@schneems
Copy link

schneems commented Oct 3, 2019

BTW great PR and thanks for working on perf! I saw this link in Ruby Weekly and wanted to check it out. Always happy to see more people working on perf in the Ruby space :)

@fatkodima
Copy link
Member Author

He says that hes' got a patch, I wonder what ever happened to it.

Same thought.

Yes, impressive speedup using recursive hash! Btw, I rewatched your great(!) "Slow Mo" presentation not so long time ago and considered this technique as well. But, as Mike mentioned, we basically optimizing "hello world" jobs here and most time will be spent in the jobs code itself, so not 100% sure it is worth increasing (not so much, as for me, but nevertheless) complexity more.

@mhenrixon
Copy link
Contributor

Like I commented in #4308 and mhenrixon/sidekiq-unique-jobs#427 this merge broke sidekiq completely :)

Sometimes performance isn't everything :) If you really need performance, instead of hunting weird ways of getting more performance. Maybe you should have a look at rust, go or crystal instead?

@mhenrixon
Copy link
Contributor

mhenrixon commented Oct 3, 2019

https://code.dblock.org/2017/02/24/the-demonic-possession-of-hashie-mash.html is what you get when you try to do these things yourself (getting rid of json/marshal dump/parse)

@fatkodima
Copy link
Member Author

@mhenrixon I will investigate what broke your build.

@palkan
Copy link

palkan commented Oct 3, 2019

Wow, what a thread!

Here is my two cents: use Refinements!

Calculating -------------------------------------
             marshal    220.860k (± 2.8%) i/s -      1.116M in   5.056059s
            deep_dup    287.621k (± 1.8%) i/s -      1.443M in   5.019736s
deep_dup_without_case
                        325.894k (± 2.5%) i/s -      1.642M in   5.041585s
       hash deep dup    396.817k (± 2.0%) i/s -      1.994M in   5.026594s
    refined deep dup    408.649k (± 5.0%) i/s -      2.043M in   5.013803s

Comparison:
    refined deep dup:   408648.7 i/s
       hash deep dup:   396816.9 i/s - same-ish: difference falls within error
deep_dup_without_case:  325893.6 i/s - 1.25x  slower
            deep_dup:   287620.6 i/s - 1.42x  slower
             marshal:   220859.5 i/s - 1.85x  slower

The improvement comparing to the recursive hash proposed by @schneems is negligible, and it's not a surprise—it works almost the same way but at a different level.

https://gist.github.com/palkan/822f5b324d0278e98e95770d39095588

@schneems
Copy link

schneems commented Oct 3, 2019

Here is my two cents: use Refinements!

So...I actually did that in sprockets. Not sure which version of the "slo mo" talk that was watched but I actually got faster results with refinements than with a recursive hash. When I gave the talk at RubyConf in Cincinnatti Charles Nutter and Matz were in the audience and basically, both of them told me not to use refinements for performance. Turns out they are super not optimized on jruby but maybe that's no longer the case? You could probably ping him on twitter, he's pretty responsive.

Here's my sprockets experiment rails/sprockets#417

@schneems
Copy link

schneems commented Oct 3, 2019

https://code.dblock.org/2017/02/24/the-demonic-possession-of-hashie-mash.html is what you get when you try to do these things yourself (getting rid of json/marshal dump/parse)

I have thoughts about Hashie, especially when it comes to perf https://www.schneems.com/2014/12/15/hashie-considered-harmful.html

@headius
Copy link

headius commented Oct 5, 2019

@palkan @schneems Here's JRuby numbers with MRI 2.6.4 for comparison. tldr: refinements don't seem to be that slow but aren't the fastest result.

[] ~/projects/jruby $ jruby bench_deep_dup.rb 
...
Calculating -------------------------------------
             marshal    232.812k (± 4.9%) i/s -      1.169M in   5.036242s
            deep_dup    469.939k (± 3.8%) i/s -      2.361M in   5.032599s
deep_dup_without_case
                        494.673k (± 2.0%) i/s -      2.471M in   4.997167s
       hash deep dup    608.382k (± 2.0%) i/s -      3.066M in   5.041291s
    refined deep dup    532.336k (± 1.9%) i/s -      2.687M in   5.048934s

Comparison:
       hash deep dup:   608382.1 i/s
    refined deep dup:   532336.4 i/s - 1.14x  slower
deep_dup_without_case:   494672.6 i/s - 1.23x  slower
            deep_dup:   469939.3 i/s - 1.29x  slower
             marshal:   232812.1 i/s - 2.61x  slower

[] ~/projects/jruby $ rvm ruby-2.6.4 do ruby bench_deep_dup.rb 
...
Calculating -------------------------------------
             marshal    132.230k (± 2.4%) i/s -    661.388k in   5.004781s
            deep_dup    264.685k (± 4.2%) i/s -      1.323M in   5.006854s
deep_dup_without_case
                        330.300k (± 1.5%) i/s -      1.675M in   5.070913s
       hash deep dup    411.352k (± 2.1%) i/s -      2.080M in   5.058988s
    refined deep dup    417.759k (± 1.7%) i/s -      2.116M in   5.066170s

Comparison:
    refined deep dup:   417759.4 i/s
       hash deep dup:   411351.9 i/s - same-ish: difference falls within error
deep_dup_without_case:   330299.9 i/s - 1.26x  slower
            deep_dup:   264684.6 i/s - 1.58x  slower
             marshal:   132229.9 i/s - 3.16x  slower

@palkan
Copy link

palkan commented Oct 5, 2019

Hey @headius! Thanks for the benchmark. Glad to see that refinements are not slow 🙂

@eregon
Copy link
Contributor

eregon commented Oct 6, 2019

Let's start with the obvious:

        obj.each_with_index do |value, index|
          duped[index] = deep_dup(value)
        end

could be the more idiomatic and faster:

        obj.map { |e| deep_dup(e) }

On MRI, before:

Comparison:
    refined deep dup:   528002.8 i/s
       hash deep dup:   511869.7 i/s - same-ish: difference falls within error
         deep_dup_if:   396657.9 i/s - 1.33x  slower
            deep_dup:   329887.6 i/s - 1.60x  slower
             marshal:   234576.1 i/s - 2.25x  slower

After:

Comparison:
    refined deep dup:   616183.4 i/s
       hash deep dup:   535666.8 i/s - 1.15x  slower
         deep_dup_if:   461013.3 i/s - 1.34x  slower
            deep_dup:   406440.3 i/s - 1.52x  slower
             marshal:   240398.7 i/s - 2.56x  slower

The same goes for JRuby and TruffleRuby BTW, Array#map is just faster than dealing with indices manually.

I'll make a PR with that (#4313).

@eregon
Copy link
Contributor

eregon commented Oct 6, 2019

@eregon Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.
If this is a bug in newer rubies, I'm unobtrusively delegating that to you as a more familiar with all that kitchen

It is a performance bug, I think MRI should inline cache the call to === for case/when.
JRuby and TruffleRuby seem to have the same performance for case/when and if.
Thank you for the link, indeed it seems we are not the first to find this performance bug in MRI.
I filed https://bugs.ruby-lang.org/issues/16243.

I think it's a good rule in general to file performance bugs to MRI when idiomatic code which performs the same semantics is slower than more verbose/less idiomatic code like here.

For fun, here is what running the benchmark (using map) gives on TruffleRuby 19.2.0:

             marshal     67.959k (±23.2%) i/s -    321.640k in   5.146115s
            deep_dup      3.591M (±21.7%) i/s -     16.725M in   5.063378s
         deep_dup_if      4.302M (±21.4%) i/s -     19.779M in   5.010366s
       hash deep dup      2.552M (±23.9%) i/s -     11.698M in   5.007283s
    refined deep dup      4.725M (±22.4%) i/s -     21.664M in   5.057021s

Comparison:
    refined deep dup:  4725005.9 i/s
         deep_dup_if:  4301638.4 i/s - same-ish: difference falls within error
            deep_dup:  3591330.8 i/s - same-ish: difference falls within error
       hash deep dup:  2552401.9 i/s - 1.85x  slower
             marshal:    67958.6 i/s - 69.53x  slower

It's almost an order of magnitude faster than MRI. There is a large error margin, because most time is spent in GC and allocations when the code is well optimized.
Refinements perform well, because each call site implicitly inline caches just on the types seen, and does not need to check other types.
Both case and if perform nicely as well.
The hash approach is slower, I'm not sure why but probably partly because a modulo operation is currently used to find the bucket in TruffleRuby and multiple classes might end up in the same bucket. That approach would also not work if there can be subclasses.
Marshal is very slow, it probably needs a more optimized implementation in TruffleRuby.

Maybe we should make deep_dup a core method in Ruby. That would allow to optimize it even better.

eregon added a commit to eregon/sidekiq that referenced this pull request Oct 6, 2019
* On the benchmark from sidekiq#4303
  on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s.
mperham pushed a commit that referenced this pull request Oct 6, 2019
* Prefer the faster Array#map to a manual implementation

* On the benchmark from #4303
  on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s.

* Use `duped` as the return value in the only branch it's needed
@mperham
Copy link
Collaborator

mperham commented Oct 7, 2019

So the cloning adds about 10% overhead on my Mac:

Stock:
~/src/sidekiq (master *=)$ bundle exec bin/sidekiqload
2019-10-07T15:53:59.778Z pid=738 tid=zqm5tm ERROR: Created 100000 jobs
2019-10-07T15:54:10.529Z pid=738 tid=100nmrm ERROR: Done, 100000 jobs in 10.750717 sec

No clone:
2019-10-07T15:53:25.577Z pid=701 tid=zr8gjh ERROR: Created 100000 jobs
2019-10-07T15:53:35.308Z pid=701 tid=101heoh ERROR: Done, 100000 jobs in 9.731436 sec

We can't freeze the args because of backwards compatibility -- there's likely middleware and LOTS of jobs which mutable their array or hash arguments somehow. We can't break them.

There is a way to rejigger the code so that we don't need to clone: instead we keep the original String of JSON and re-parse it if we need access to the original, untouched job.

So the moral of this story is: you can microbenchmark all you want but the fastest solution is no code at all. 😄

@fatkodima
Copy link
Member Author

@mperham
True. Initially I have similar idea, but at that time it seems a little complicated in implementation and with lower probability of being merged (😁), so I skipped it. Now it seems I figured out better than previous way of doing it. So if you not started and want to offload this to someone else, I will give it a shot 😁

@schneems
Copy link

schneems commented Oct 7, 2019

@eregon I think deep_dup would be an awesome core extension, I also wonder if there could maybe be a deep method that takes a block and yields to the original arguments, we could use something like that in the sprockets codebase (since we're not dup-ing but still need to do this recursive deep dive to build a SHA).

@mperham agreed, it's also hard to know if a micro-optimization affects the overall runtime of a system, or if we're debating over fractions of fractions. Rails started rejecting perf patches unless they can show that it made things faster in a meaningful way on a Rails app. To that end I coded up this feature in derailed benchmarks

While it's geared towards Rails, there are docs on how to use it with another library such as sidekiq. There's not background job integration out of the box right now, but you can get it pretty easy doing something like this: zombocom/derailed_benchmarks#145 (comment)

We can't freeze the args because of backwards compatibility -- there's likely middleware and LOTS of jobs which mutable their array or hash arguments somehow. We can't break them.

It would be nice if there was a way to safely deprecate mutating an arg, I guess you could patch all the mutating core methods, though that sounds difficult. I'm wondering if there could potentially be a feature to make this easier? A callback proc when an object is mutated perhaps?

@mperham
Copy link
Collaborator

mperham commented Oct 7, 2019

@fatkodima Go ahead, it would be nice to get a fresh pair of eyes on the problem. The cloning was originally added so that the job_hash data that was sent to the Web UI in stats was not mutated. You can send jobstr instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants