Optimize cloning of job payload #4303

fatkodima · 2019-10-01T23:27:06Z

Profiling bin/sidekiqload, I got:

For cpu:
Before:

==================================
  Mode: cpu(1000)
  Samples: 15951 (1.51% miss rate)
  GC: 1804 (11.31%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
     10483  (65.7%)       10483  (65.7%)     Hiredis::Ext::Connection#read
      2226  (14.0%)        2226  (14.0%)     Sidekiq::Processor#cloned
      1804  (11.3%)        1804  (11.3%)     (garbage collection)
       704   (4.4%)         704   (4.4%)     #<Module:0x00007fd0ae32f718>.parse
       137   (0.9%)         107   (0.7%)     Sidekiq::LoggingUtils#add
        75   (0.5%)          75   (0.5%)     Redis::Connection::Hiredis#write
        67   (0.4%)          67   (0.4%)     Sidekiq::Processor#constantize
        59   (0.4%)          59   (0.4%)     Redis::Connection::Hiredis#timeout=
     10792  (67.7%)          32   (0.2%)     ConnectionPool#with

After:

==================================
  Mode: cpu(1000)
  Samples: 14098 (1.34% miss rate)
  GC: 1528 (10.84%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
      9625  (68.3%)        9625  (68.3%)     Hiredis::Ext::Connection#read
      1528  (10.8%)        1528  (10.8%)     (garbage collection)
      1281   (9.1%)        1281   (9.1%)     Sidekiq::Processor#deep_dup
       656   (4.7%)         656   (4.7%)     #<Module:0x00007f94bcbeb768>.parse
       187   (1.3%)         127   (0.9%)     Sidekiq::LoggingUtils#add
        79   (0.6%)          79   (0.6%)     Redis::Connection::Hiredis#write
        68   (0.5%)          68   (0.5%)     Sidekiq::Processor#constantize
      9834  (69.8%)          62   (0.4%)     Redis#_bpop

So, 14.0 - 9.1 = 5% ⚡️ cpu savings

For object allocations:
Before:

==================================
  Mode: object(1)
  Samples: 8206264 (0.00% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
   2300072  (28.0%)     2300072  (28.0%)     Sidekiq::Processor#cloned
   1500042  (18.3%)     1500042  (18.3%)     #<Module:0x00007fc3751fb568>.parse
   4000164  (48.7%)      900028  (11.0%)     Sidekiq::Processor#dispatch
   2304025  (28.1%)      500591   (6.1%)     ConnectionPool#with
    400898   (4.9%)      400898   (4.9%)     Redis::Connection::Hiredis#write
   1200326  (14.6%)      400052   (4.9%)     Redis#_bpop

After:

==================================
  Mode: object(1)
  Samples: 7105331 (0.00% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
   1500042  (21.1%)     1500042  (21.1%)     #<Module:0x00007ff923367648>.parse
   1200036  (16.9%)     1200036  (16.9%)     Sidekiq::Processor#deep_dup
   2900112  (40.8%)      900028  (12.7%)     Sidekiq::Processor#dispatch
   2303422  (32.4%)      500491   (7.0%)     ConnectionPool#with
    400749   (5.6%)      400749   (5.6%)     Redis::Connection::Hiredis#write
   1200306  (16.9%)      400052   (5.6%)     Redis#_bpop

So, 28.0 - 16.9 = 11% memory savings.

Profiling with memory_profiler, I got more accurate results for memory:
Before:

Total allocated: 695737343 bytes (6904607 objects)
Total retained:  12702768 bytes (972 objects)

After:

Total allocated: 606804659 bytes (5704475 objects)
Total retained:  12699800 bytes (962 objects)

So, savings in bytes: (695737343 - 606804659) / 695737343.0 * 100 = 13% ⚡️
Savings in objects: (6904607 - 5704475) / 6904607.0 * 100 = 17.5% ⚡️

Another isolated benchmark:

# frozen_string_literal: true

require "benchmark/ips"

def deep_dup(obj)
  case obj
  when Integer, Float, TrueClass, FalseClass, NilClass
    return obj
  when String
    return obj.dup
  when Array
    duped = Array.new(obj.size)
    duped.each_with_index do |value, index|
      duped[index] = deep_dup(value)
    end
  when Hash
    duped = obj.dup
    duped.each_pair do |key, value|
      duped[key] = deep_dup(value)
    end
  else
    duped = obj.dup
  end

  duped
end

def deep_dup_without_case(obj)
  if Integer === obj || Float === obj || TrueClass === obj || FalseClass === obj || NilClass === obj
    return obj
  elsif String === obj
    return obj.dup
  elsif Array === obj
    duped = Array.new(obj.size)
    obj.each_with_index do |value, index|
      duped[index] = deep_dup_without_case(value)
    end
  elsif Hash === obj
    duped = obj.dup
    duped.each_pair do |key, value|
      duped[key] = deep_dup_without_case(value)
    end
  else
    duped = obj.dup
  end

  duped
end


obj = { "class" => "FooWorker", "args" => [1, 2, 3, "foobar"], "jid" => "123987123" }

Benchmark.ips do |x|
  x.report("marshal") do
    Marshal.load(Marshal.dump(obj))
  end

  x.report("deep_dup") do
    deep_dup(obj)
  end

  x.report("deep_dup_without_case") do
    deep_dup_without_case(obj)
  end

  x.compare!
end

Results:

Warming up --------------------------------------
             marshal    11.146k i/100ms
            deep_dup    13.970k i/100ms
deep_dup_without_case
                        16.752k i/100ms
Calculating -------------------------------------
             marshal    105.758k (±13.8%) i/s -    523.862k in   5.083462s
            deep_dup    178.748k (± 3.3%) i/s -    894.080k in   5.008318s
deep_dup_without_case
                        200.419k (± 2.0%) i/s -      1.005M in   5.017126s

Comparison:
deep_dup_without_case:   200419.0 i/s
            deep_dup:   178748.4 i/s - 1.12x  slower
             marshal:   105758.2 i/s - 1.90x  slower

So while it seems that if-else is a bit uglier than case/when, it is much faster.

mperham · 2019-10-01T23:39:24Z

I’m not sure I want to maintain that ugly code to shave a usec off the marshal. Keep in mind we are optimizing “hello world” and most time will be spent in the user’s job code.

…

On Oct 1, 2019, at 16:27, fatkodima ***@***.***> wrote: Profiling bin/sidekiqload, I got: For cpu: Before: ================================== Mode: cpu(1000) Samples: 15951 (1.51% miss rate) GC: 1804 (11.31%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 10483 (65.7%) 10483 (65.7%) Hiredis::Ext::Connection#read 2226 (14.0%) 2226 (14.0%) Sidekiq::Processor#cloned 1804 (11.3%) 1804 (11.3%) (garbage collection) 704 (4.4%) 704 (4.4%) #<Module:0x00007fd0ae32f718>.parse 137 (0.9%) 107 (0.7%) Sidekiq::LoggingUtils#add 75 (0.5%) 75 (0.5%) Redis::Connection::Hiredis#write 67 (0.4%) 67 (0.4%) Sidekiq::Processor#constantize 59 (0.4%) 59 (0.4%) Redis::Connection::Hiredis#timeout= 10792 (67.7%) 32 (0.2%) ConnectionPool#with After: ================================== Mode: cpu(1000) Samples: 14098 (1.34% miss rate) GC: 1528 (10.84%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 9625 (68.3%) 9625 (68.3%) Hiredis::Ext::Connection#read 1528 (10.8%) 1528 (10.8%) (garbage collection) 1281 (9.1%) 1281 (9.1%) Sidekiq::Processor#deep_dup 656 (4.7%) 656 (4.7%) #<Module:0x00007f94bcbeb768>.parse 187 (1.3%) 127 (0.9%) Sidekiq::LoggingUtils#add 79 (0.6%) 79 (0.6%) Redis::Connection::Hiredis#write 68 (0.5%) 68 (0.5%) Sidekiq::Processor#constantize 9834 (69.8%) 62 (0.4%) Redis#_bpop So, 14.0 - 9.1 = 5% ⚡️ cpu savings For object allocations: Before: ================================== Mode: object(1) Samples: 8206264 (0.00% miss rate) GC: 0 (0.00%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 2300072 (28.0%) 2300072 (28.0%) Sidekiq::Processor#cloned 1500042 (18.3%) 1500042 (18.3%) #<Module:0x00007fc3751fb568>.parse 4000164 (48.7%) 900028 (11.0%) Sidekiq::Processor#dispatch 2304025 (28.1%) 500591 (6.1%) ConnectionPool#with 400898 (4.9%) 400898 (4.9%) Redis::Connection::Hiredis#write 1200326 (14.6%) 400052 (4.9%) Redis#_bpop After: ================================== Mode: object(1) Samples: 7105331 (0.00% miss rate) GC: 0 (0.00%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 1500042 (21.1%) 1500042 (21.1%) #<Module:0x00007ff923367648>.parse 1200036 (16.9%) 1200036 (16.9%) Sidekiq::Processor#deep_dup 2900112 (40.8%) 900028 (12.7%) Sidekiq::Processor#dispatch 2303422 (32.4%) 500491 (7.0%) ConnectionPool#with 400749 (5.6%) 400749 (5.6%) Redis::Connection::Hiredis#write 1200306 (16.9%) 400052 (5.6%) Redis#_bpop So, 28.0 - 16.9 = 11% memory savings. Profiling with memory_profiler, I got more accurate results for memory: Before: Total allocated: 695737343 bytes (6904607 objects) Total retained: 12702768 bytes (972 objects) After: Total allocated: 606804659 bytes (5704475 objects) Total retained: 12699800 bytes (962 objects) So, savings in bytes: (695737343 - 606804659) / 695737343.0 * 100 = 13% ⚡️ Savings in objects: (6904607 - 5704475) / 6904607.0 * 100 = 17.5% ⚡️ Another isolated benchmark: # frozen_string_literal: true require "benchmark/ips" def deep_dup(obj) case obj when Integer, Float, TrueClass, FalseClass, NilClass return obj when String return obj.dup when Array duped = Array.new(obj.size) duped.each_with_index do |value, index| duped[index] = deep_dup(value) end when Hash duped = obj.dup duped.each_pair do |key, value| duped[key] = deep_dup(value) end else duped = obj.dup end duped end def deep_dup_without_case(obj) if Integer === obj || Float === obj || TrueClass === obj || FalseClass === obj || NilClass === obj return obj elsif String === obj return obj.dup elsif Array === obj duped = Array.new(obj.size) obj.each_with_index do |value, index| duped[index] = deep_dup_without_case(value) end elsif Hash === obj duped = obj.dup duped.each_pair do |key, value| duped[key] = deep_dup_without_case(value) end else duped = obj.dup end duped end obj = { "class" => "FooWorker", "args" => [1, 2, 3, "foobar"], "jid" => "123987123" } Benchmark.ips do |x| x.report("marshal") do Marshal.load(Marshal.dump(obj)) end x.report("deep_dup") do deep_dup(obj) end x.report("deep_dup_without_case") do deep_dup_without_case(obj) end x.compare! end Results: Warming up -------------------------------------- marshal 11.146k i/100ms deep_dup 13.970k i/100ms deep_dup_without_case 16.752k i/100ms Calculating ------------------------------------- marshal 105.758k (±13.8%) i/s - 523.862k in 5.083462s deep_dup 178.748k (± 3.3%) i/s - 894.080k in 5.008318s deep_dup_without_case 200.419k (± 2.0%) i/s - 1.005M in 5.017126s Comparison: deep_dup_without_case: 200419.0 i/s deep_dup: 178748.4 i/s - 1.12x slower marshal: 105758.2 i/s - 1.90x slower So while it seems that if-else is a bit uglier than case/when, it is much faster. You can view, comment on, or merge this pull request online at: #4303 Commit Summary Optimize cloning of job payload File Changes M lib/sidekiq/processor.rb (24) Patch Links: https://github.com/mperham/sidekiq/pull/4303.patch https://github.com/mperham/sidekiq/pull/4303.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

fatkodima · 2019-10-01T23:50:37Z

Previous optimizations were also basically optimized "hello world", but with less observable effect.
Sure, it is uglier than previous one-liner, but if spent minute or two studying this code, it seems straightforward and clear. At least for me. But in the end, you are the final decision maker.

mperham · 2019-10-02T00:03:43Z

I guess I don’t know what the trade offs are and so I’m unsure. Why is it faster than Marshal? Will there be any compatibility issues?

I like the fact that we can simplify because we know that we only need to handle JSON types. Is that why it’s faster?

fatkodima · 2019-10-02T00:10:17Z

Will there be any compatibility issues?

No, I don't see any compatibility issues.

Is that why it’s faster?

Yes. For deep-duping generic objects code will be a bit larger and uglier (because at least we should deal with object instance variables).

mperham · 2019-10-02T01:14:13Z

All right, you talked me into it. Thanks for your patience!

mperham · 2019-10-02T01:24:55Z

I refactored things a bit and realized the code was calling cloned twice and I believe the second call is unnecessary, so even bigger win!

eregon · 2019-10-03T09:18:49Z

@fatkodima The fact that if/elsif is faster than case seems like performance bug of MRI, it should be exactly the same. Could you report it to https://bugs.ruby-lang.org/ ?

fatkodima · 2019-10-03T09:45:20Z

@eregon Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.
If this is a bug in newer rubies, I'm unobtrusively delegating that to you as a more familiar with all that kitchen 😀

schneems · 2019-10-03T19:14:29Z

Hello! I'm surprised the case statement is slower. In my experience MRI optimizes the crap out of them, do you have any idea why it is slower?

The fact that if/elsif is faster than case seems like a performance bug of MRI, it should be exactly the same.

then

Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.

From the video "the case/when does not have a call site cache". I'm not sure what internals are imposing that limitation but this being faster than case/when feels like a bug. Even in the bug "don't go changing your case/when statements to if statements, this is basically something we should fix" assuming he means ruby/ruby. He says that hes' got a patch, I wonder what ever happened to it.

Another Optimization - Recursive Hash

I optimized similar code in sprockets and I've got a killer pattern that I think you can reuse. Since you're comparing classes you can use the class as a key to a hash and then return a proc that can be executed. The procs can reference the same hash so it becomes a recursive data structure.

Hashes have an optimization that allows you to compare based essentially on object ID rather than hashing the value and since classes are singletons you can use it to get really good perf increase.

Here is a benchmark https://gist.github.com/schneems/3fb96b12d2f88d0700acc5dd7cb437ee, assuming my spike at the code is technically correct it is 1.17x faster than this PR:

       hash deep dup:   396849.2 i/s
deep_dup_without_case:   338379.3 i/s - 1.17x  slower
            deep_dup:   285229.5 i/s - 1.39x  slower
             marshal:   148201.7 i/s - 2.68x  slower

schneems · 2019-10-03T19:20:23Z

BTW great PR and thanks for working on perf! I saw this link in Ruby Weekly and wanted to check it out. Always happy to see more people working on perf in the Ruby space :)

fatkodima · 2019-10-03T19:33:50Z

He says that hes' got a patch, I wonder what ever happened to it.

Same thought.

Yes, impressive speedup using recursive hash! Btw, I rewatched your great(!) "Slow Mo" presentation not so long time ago and considered this technique as well. But, as Mike mentioned, we basically optimizing "hello world" jobs here and most time will be spent in the jobs code itself, so not 100% sure it is worth increasing (not so much, as for me, but nevertheless) complexity more.

mhenrixon · 2019-10-03T19:42:56Z

Like I commented in #4308 and mhenrixon/sidekiq-unique-jobs#427 this merge broke sidekiq completely :)

Sometimes performance isn't everything :) If you really need performance, instead of hunting weird ways of getting more performance. Maybe you should have a look at rust, go or crystal instead?

mhenrixon · 2019-10-03T19:43:50Z

https://code.dblock.org/2017/02/24/the-demonic-possession-of-hashie-mash.html is what you get when you try to do these things yourself (getting rid of json/marshal dump/parse)

fatkodima · 2019-10-03T19:44:41Z

@mhenrixon I will investigate what broke your build.

palkan · 2019-10-03T19:46:02Z

Wow, what a thread!

Here is my two cents: use Refinements!

Calculating -------------------------------------
             marshal    220.860k (± 2.8%) i/s -      1.116M in   5.056059s
            deep_dup    287.621k (± 1.8%) i/s -      1.443M in   5.019736s
deep_dup_without_case
                        325.894k (± 2.5%) i/s -      1.642M in   5.041585s
       hash deep dup    396.817k (± 2.0%) i/s -      1.994M in   5.026594s
    refined deep dup    408.649k (± 5.0%) i/s -      2.043M in   5.013803s

Comparison:
    refined deep dup:   408648.7 i/s
       hash deep dup:   396816.9 i/s - same-ish: difference falls within error
deep_dup_without_case:  325893.6 i/s - 1.25x  slower
            deep_dup:   287620.6 i/s - 1.42x  slower
             marshal:   220859.5 i/s - 1.85x  slower

The improvement comparing to the recursive hash proposed by @schneems is negligible, and it's not a surprise—it works almost the same way but at a different level.

https://gist.github.com/palkan/822f5b324d0278e98e95770d39095588

schneems · 2019-10-03T20:08:22Z

Here is my two cents: use Refinements!

So...I actually did that in sprockets. Not sure which version of the "slo mo" talk that was watched but I actually got faster results with refinements than with a recursive hash. When I gave the talk at RubyConf in Cincinnatti Charles Nutter and Matz were in the audience and basically, both of them told me not to use refinements for performance. Turns out they are super not optimized on jruby but maybe that's no longer the case? You could probably ping him on twitter, he's pretty responsive.

Here's my sprockets experiment rails/sprockets#417

schneems · 2019-10-03T20:11:14Z

https://code.dblock.org/2017/02/24/the-demonic-possession-of-hashie-mash.html is what you get when you try to do these things yourself (getting rid of json/marshal dump/parse)

I have thoughts about Hashie, especially when it comes to perf https://www.schneems.com/2014/12/15/hashie-considered-harmful.html

headius · 2019-10-05T15:58:03Z

@palkan @schneems Here's JRuby numbers with MRI 2.6.4 for comparison. tldr: refinements don't seem to be that slow but aren't the fastest result.

[] ~/projects/jruby $ jruby bench_deep_dup.rb 
...
Calculating -------------------------------------
             marshal    232.812k (± 4.9%) i/s -      1.169M in   5.036242s
            deep_dup    469.939k (± 3.8%) i/s -      2.361M in   5.032599s
deep_dup_without_case
                        494.673k (± 2.0%) i/s -      2.471M in   4.997167s
       hash deep dup    608.382k (± 2.0%) i/s -      3.066M in   5.041291s
    refined deep dup    532.336k (± 1.9%) i/s -      2.687M in   5.048934s

Comparison:
       hash deep dup:   608382.1 i/s
    refined deep dup:   532336.4 i/s - 1.14x  slower
deep_dup_without_case:   494672.6 i/s - 1.23x  slower
            deep_dup:   469939.3 i/s - 1.29x  slower
             marshal:   232812.1 i/s - 2.61x  slower

[] ~/projects/jruby $ rvm ruby-2.6.4 do ruby bench_deep_dup.rb 
...
Calculating -------------------------------------
             marshal    132.230k (± 2.4%) i/s -    661.388k in   5.004781s
            deep_dup    264.685k (± 4.2%) i/s -      1.323M in   5.006854s
deep_dup_without_case
                        330.300k (± 1.5%) i/s -      1.675M in   5.070913s
       hash deep dup    411.352k (± 2.1%) i/s -      2.080M in   5.058988s
    refined deep dup    417.759k (± 1.7%) i/s -      2.116M in   5.066170s

Comparison:
    refined deep dup:   417759.4 i/s
       hash deep dup:   411351.9 i/s - same-ish: difference falls within error
deep_dup_without_case:   330299.9 i/s - 1.26x  slower
            deep_dup:   264684.6 i/s - 1.58x  slower
             marshal:   132229.9 i/s - 3.16x  slower

palkan · 2019-10-05T18:24:29Z

Hey @headius! Thanks for the benchmark. Glad to see that refinements are not slow 🙂

eregon · 2019-10-06T16:03:53Z

Let's start with the obvious:

        obj.each_with_index do |value, index|
          duped[index] = deep_dup(value)
        end

could be the more idiomatic and faster:

        obj.map { |e| deep_dup(e) }

On MRI, before:

Comparison:
    refined deep dup:   528002.8 i/s
       hash deep dup:   511869.7 i/s - same-ish: difference falls within error
         deep_dup_if:   396657.9 i/s - 1.33x  slower
            deep_dup:   329887.6 i/s - 1.60x  slower
             marshal:   234576.1 i/s - 2.25x  slower

After:

Comparison:
    refined deep dup:   616183.4 i/s
       hash deep dup:   535666.8 i/s - 1.15x  slower
         deep_dup_if:   461013.3 i/s - 1.34x  slower
            deep_dup:   406440.3 i/s - 1.52x  slower
             marshal:   240398.7 i/s - 2.56x  slower

The same goes for JRuby and TruffleRuby BTW, Array#map is just faster than dealing with indices manually.

I'll make a PR with that (#4313).

eregon · 2019-10-06T16:30:59Z

@eregon Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation.
If this is a bug in newer rubies, I'm unobtrusively delegating that to you as a more familiar with all that kitchen

It is a performance bug, I think MRI should inline cache the call to === for case/when.
JRuby and TruffleRuby seem to have the same performance for case/when and if.
Thank you for the link, indeed it seems we are not the first to find this performance bug in MRI.
I filed https://bugs.ruby-lang.org/issues/16243.

I think it's a good rule in general to file performance bugs to MRI when idiomatic code which performs the same semantics is slower than more verbose/less idiomatic code like here.

For fun, here is what running the benchmark (using map) gives on TruffleRuby 19.2.0:

             marshal     67.959k (±23.2%) i/s -    321.640k in   5.146115s
            deep_dup      3.591M (±21.7%) i/s -     16.725M in   5.063378s
         deep_dup_if      4.302M (±21.4%) i/s -     19.779M in   5.010366s
       hash deep dup      2.552M (±23.9%) i/s -     11.698M in   5.007283s
    refined deep dup      4.725M (±22.4%) i/s -     21.664M in   5.057021s

Comparison:
    refined deep dup:  4725005.9 i/s
         deep_dup_if:  4301638.4 i/s - same-ish: difference falls within error
            deep_dup:  3591330.8 i/s - same-ish: difference falls within error
       hash deep dup:  2552401.9 i/s - 1.85x  slower
             marshal:    67958.6 i/s - 69.53x  slower

It's almost an order of magnitude faster than MRI. There is a large error margin, because most time is spent in GC and allocations when the code is well optimized.
Refinements perform well, because each call site implicitly inline caches just on the types seen, and does not need to check other types.
Both case and if perform nicely as well.
The hash approach is slower, I'm not sure why but probably partly because a modulo operation is currently used to find the bucket in TruffleRuby and multiple classes might end up in the same bucket. That approach would also not work if there can be subclasses.
Marshal is very slow, it probably needs a more optimized implementation in TruffleRuby.

Maybe we should make deep_dup a core method in Ruby. That would allow to optimize it even better.

* On the benchmark from sidekiq#4303 on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s.

* Prefer the faster Array#map to a manual implementation * On the benchmark from #4303 on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s. * Use `duped` as the return value in the only branch it's needed

mperham · 2019-10-07T16:00:17Z

So the cloning adds about 10% overhead on my Mac:

Stock:
~/src/sidekiq (master *=)$ bundle exec bin/sidekiqload
2019-10-07T15:53:59.778Z pid=738 tid=zqm5tm ERROR: Created 100000 jobs
2019-10-07T15:54:10.529Z pid=738 tid=100nmrm ERROR: Done, 100000 jobs in 10.750717 sec

No clone:
2019-10-07T15:53:25.577Z pid=701 tid=zr8gjh ERROR: Created 100000 jobs
2019-10-07T15:53:35.308Z pid=701 tid=101heoh ERROR: Done, 100000 jobs in 9.731436 sec

We can't freeze the args because of backwards compatibility -- there's likely middleware and LOTS of jobs which mutable their array or hash arguments somehow. We can't break them.

There is a way to rejigger the code so that we don't need to clone: instead we keep the original String of JSON and re-parse it if we need access to the original, untouched job.

So the moral of this story is: you can microbenchmark all you want but the fastest solution is no code at all. 😄

fatkodima · 2019-10-07T16:22:16Z

@mperham
True. Initially I have similar idea, but at that time it seems a little complicated in implementation and with lower probability of being merged (😁), so I skipped it. Now it seems I figured out better than previous way of doing it. So if you not started and want to offload this to someone else, I will give it a shot 😁

schneems · 2019-10-07T16:23:21Z

@eregon I think deep_dup would be an awesome core extension, I also wonder if there could maybe be a deep method that takes a block and yields to the original arguments, we could use something like that in the sprockets codebase (since we're not dup-ing but still need to do this recursive deep dive to build a SHA).

@mperham agreed, it's also hard to know if a micro-optimization affects the overall runtime of a system, or if we're debating over fractions of fractions. Rails started rejecting perf patches unless they can show that it made things faster in a meaningful way on a Rails app. To that end I coded up this feature in derailed benchmarks

https://github.com/schneems/derailed_benchmarks/blob/master/README.md#i-made-a-patch-to-to-rails-how-can-i-tell-if-it-made-my-rails-app-faster-and-test-for-statistical-significance

While it's geared towards Rails, there are docs on how to use it with another library such as sidekiq. There's not background job integration out of the box right now, but you can get it pretty easy doing something like this: zombocom/derailed_benchmarks#145 (comment)

We can't freeze the args because of backwards compatibility -- there's likely middleware and LOTS of jobs which mutable their array or hash arguments somehow. We can't break them.

It would be nice if there was a way to safely deprecate mutating an arg, I guess you could patch all the mutating core methods, though that sounds difficult. I'm wondering if there could potentially be a feature to make this easier? A callback proc when an object is mutated perhaps?

mperham · 2019-10-07T16:28:44Z

@fatkodima Go ahead, it would be nice to get a fresh pair of eyes on the problem. The cloning was originally added so that the job_hash data that was sent to the Web UI in stats was not mutated. You can send jobstr instead.

Optimize cloning of job payload

7978aaf

mperham merged commit acdfc59 into sidekiq:master Oct 2, 2019

mperham mentioned this pull request Oct 3, 2019

v6.0.1 completely broke sidekiq-unique-jobs #4308

Closed

eregon added a commit to eregon/sidekiq that referenced this pull request Oct 6, 2019

Prefer the faster Array#map to a manual implementation

59a7bc8

* On the benchmark from sidekiq#4303 on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s.

eregon mentioned this pull request Oct 6, 2019

Prefer the faster and more idiomatic Array#map in json_clone #4313

Merged

fatkodima mentioned this pull request Feb 5, 2023

Performance improvements #5768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cloning of job payload #4303

Optimize cloning of job payload #4303

fatkodima commented Oct 1, 2019

mperham commented Oct 1, 2019 via email

fatkodima commented Oct 1, 2019

mperham commented Oct 2, 2019

fatkodima commented Oct 2, 2019

mperham commented Oct 2, 2019

mperham commented Oct 2, 2019

eregon commented Oct 3, 2019

fatkodima commented Oct 3, 2019

schneems commented Oct 3, 2019

schneems commented Oct 3, 2019

fatkodima commented Oct 3, 2019

mhenrixon commented Oct 3, 2019

mhenrixon commented Oct 3, 2019 •

edited

fatkodima commented Oct 3, 2019

palkan commented Oct 3, 2019

schneems commented Oct 3, 2019

schneems commented Oct 3, 2019

headius commented Oct 5, 2019

palkan commented Oct 5, 2019

eregon commented Oct 6, 2019 •

edited

eregon commented Oct 6, 2019

mperham commented Oct 7, 2019 •

edited

fatkodima commented Oct 7, 2019

schneems commented Oct 7, 2019

mperham commented Oct 7, 2019

Optimize cloning of job payload #4303

Optimize cloning of job payload #4303

Conversation

fatkodima commented Oct 1, 2019

mperham commented Oct 1, 2019 via email

fatkodima commented Oct 1, 2019

mperham commented Oct 2, 2019

fatkodima commented Oct 2, 2019

mperham commented Oct 2, 2019

mperham commented Oct 2, 2019

eregon commented Oct 3, 2019

fatkodima commented Oct 3, 2019

schneems commented Oct 3, 2019

Another Optimization - Recursive Hash

schneems commented Oct 3, 2019

fatkodima commented Oct 3, 2019

mhenrixon commented Oct 3, 2019

mhenrixon commented Oct 3, 2019 • edited

fatkodima commented Oct 3, 2019

palkan commented Oct 3, 2019

schneems commented Oct 3, 2019

schneems commented Oct 3, 2019

headius commented Oct 5, 2019

palkan commented Oct 5, 2019

eregon commented Oct 6, 2019 • edited

eregon commented Oct 6, 2019

mperham commented Oct 7, 2019 • edited

fatkodima commented Oct 7, 2019

schneems commented Oct 7, 2019

mperham commented Oct 7, 2019

mhenrixon commented Oct 3, 2019 •

edited

eregon commented Oct 6, 2019 •

edited

mperham commented Oct 7, 2019 •

edited