New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cloning of job payload #4303
Conversation
I’m not sure I want to maintain that ugly code to shave a usec off the marshal. Keep in mind we are optimizing “hello world” and most time will be spent in the user’s job code.
… On Oct 1, 2019, at 16:27, fatkodima ***@***.***> wrote:
Profiling bin/sidekiqload, I got:
For cpu:
Before:
==================================
Mode: cpu(1000)
Samples: 15951 (1.51% miss rate)
GC: 1804 (11.31%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
10483 (65.7%) 10483 (65.7%) Hiredis::Ext::Connection#read
2226 (14.0%) 2226 (14.0%) Sidekiq::Processor#cloned
1804 (11.3%) 1804 (11.3%) (garbage collection)
704 (4.4%) 704 (4.4%) #<Module:0x00007fd0ae32f718>.parse
137 (0.9%) 107 (0.7%) Sidekiq::LoggingUtils#add
75 (0.5%) 75 (0.5%) Redis::Connection::Hiredis#write
67 (0.4%) 67 (0.4%) Sidekiq::Processor#constantize
59 (0.4%) 59 (0.4%) Redis::Connection::Hiredis#timeout=
10792 (67.7%) 32 (0.2%) ConnectionPool#with
After:
==================================
Mode: cpu(1000)
Samples: 14098 (1.34% miss rate)
GC: 1528 (10.84%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
9625 (68.3%) 9625 (68.3%) Hiredis::Ext::Connection#read
1528 (10.8%) 1528 (10.8%) (garbage collection)
1281 (9.1%) 1281 (9.1%) Sidekiq::Processor#deep_dup
656 (4.7%) 656 (4.7%) #<Module:0x00007f94bcbeb768>.parse
187 (1.3%) 127 (0.9%) Sidekiq::LoggingUtils#add
79 (0.6%) 79 (0.6%) Redis::Connection::Hiredis#write
68 (0.5%) 68 (0.5%) Sidekiq::Processor#constantize
9834 (69.8%) 62 (0.4%) Redis#_bpop
So, 14.0 - 9.1 = 5% ⚡️ cpu savings
For object allocations:
Before:
==================================
Mode: object(1)
Samples: 8206264 (0.00% miss rate)
GC: 0 (0.00%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
2300072 (28.0%) 2300072 (28.0%) Sidekiq::Processor#cloned
1500042 (18.3%) 1500042 (18.3%) #<Module:0x00007fc3751fb568>.parse
4000164 (48.7%) 900028 (11.0%) Sidekiq::Processor#dispatch
2304025 (28.1%) 500591 (6.1%) ConnectionPool#with
400898 (4.9%) 400898 (4.9%) Redis::Connection::Hiredis#write
1200326 (14.6%) 400052 (4.9%) Redis#_bpop
After:
==================================
Mode: object(1)
Samples: 7105331 (0.00% miss rate)
GC: 0 (0.00%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
1500042 (21.1%) 1500042 (21.1%) #<Module:0x00007ff923367648>.parse
1200036 (16.9%) 1200036 (16.9%) Sidekiq::Processor#deep_dup
2900112 (40.8%) 900028 (12.7%) Sidekiq::Processor#dispatch
2303422 (32.4%) 500491 (7.0%) ConnectionPool#with
400749 (5.6%) 400749 (5.6%) Redis::Connection::Hiredis#write
1200306 (16.9%) 400052 (5.6%) Redis#_bpop
So, 28.0 - 16.9 = 11% memory savings.
Profiling with memory_profiler, I got more accurate results for memory:
Before:
Total allocated: 695737343 bytes (6904607 objects)
Total retained: 12702768 bytes (972 objects)
After:
Total allocated: 606804659 bytes (5704475 objects)
Total retained: 12699800 bytes (962 objects)
So, savings in bytes: (695737343 - 606804659) / 695737343.0 * 100 = 13% ⚡️
Savings in objects: (6904607 - 5704475) / 6904607.0 * 100 = 17.5% ⚡️
Another isolated benchmark:
# frozen_string_literal: true
require "benchmark/ips"
def deep_dup(obj)
case obj
when Integer, Float, TrueClass, FalseClass, NilClass
return obj
when String
return obj.dup
when Array
duped = Array.new(obj.size)
duped.each_with_index do |value, index|
duped[index] = deep_dup(value)
end
when Hash
duped = obj.dup
duped.each_pair do |key, value|
duped[key] = deep_dup(value)
end
else
duped = obj.dup
end
duped
end
def deep_dup_without_case(obj)
if Integer === obj || Float === obj || TrueClass === obj || FalseClass === obj || NilClass === obj
return obj
elsif String === obj
return obj.dup
elsif Array === obj
duped = Array.new(obj.size)
obj.each_with_index do |value, index|
duped[index] = deep_dup_without_case(value)
end
elsif Hash === obj
duped = obj.dup
duped.each_pair do |key, value|
duped[key] = deep_dup_without_case(value)
end
else
duped = obj.dup
end
duped
end
obj = { "class" => "FooWorker", "args" => [1, 2, 3, "foobar"], "jid" => "123987123" }
Benchmark.ips do |x|
x.report("marshal") do
Marshal.load(Marshal.dump(obj))
end
x.report("deep_dup") do
deep_dup(obj)
end
x.report("deep_dup_without_case") do
deep_dup_without_case(obj)
end
x.compare!
end
Results:
Warming up --------------------------------------
marshal 11.146k i/100ms
deep_dup 13.970k i/100ms
deep_dup_without_case
16.752k i/100ms
Calculating -------------------------------------
marshal 105.758k (±13.8%) i/s - 523.862k in 5.083462s
deep_dup 178.748k (± 3.3%) i/s - 894.080k in 5.008318s
deep_dup_without_case
200.419k (± 2.0%) i/s - 1.005M in 5.017126s
Comparison:
deep_dup_without_case: 200419.0 i/s
deep_dup: 178748.4 i/s - 1.12x slower
marshal: 105758.2 i/s - 1.90x slower
So while it seems that if-else is a bit uglier than case/when, it is much faster.
You can view, comment on, or merge this pull request online at:
#4303
Commit Summary
Optimize cloning of job payload
File Changes
M lib/sidekiq/processor.rb (24)
Patch Links:
https://github.com/mperham/sidekiq/pull/4303.patch
https://github.com/mperham/sidekiq/pull/4303.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Previous optimizations were also basically optimized "hello world", but with less observable effect. |
I guess I don’t know what the trade offs are and so I’m unsure. Why is it faster than Marshal? Will there be any compatibility issues? I like the fact that we can simplify because we know that we only need to handle JSON types. Is that why it’s faster? |
No, I don't see any compatibility issues.
Yes. For deep-duping generic objects code will be a bit larger and uglier (because at least we should deal with object instance variables). |
All right, you talked me into it. Thanks for your patience! |
I refactored things a bit and realized the code was calling |
@fatkodima The fact that |
@eregon Looks like this is not a bug? Here is a video from 2016 (with proper timestamp included) from Aaron Patterson with explanation. |
Hello! I'm surprised the case statement is slower. In my experience MRI optimizes the crap out of them, do you have any idea why it is slower?
then
From the video "the case/when does not have a call site cache". I'm not sure what internals are imposing that limitation but this being faster than Another Optimization - Recursive HashI optimized similar code in sprockets and I've got a killer pattern that I think you can reuse. Since you're comparing classes you can use the class as a key to a hash and then return a proc that can be executed. The procs can reference the same hash so it becomes a recursive data structure. Hashes have an optimization that allows you to compare based essentially on object ID rather than hashing the value and since classes are singletons you can use it to get really good perf increase.
Here is a benchmark https://gist.github.com/schneems/3fb96b12d2f88d0700acc5dd7cb437ee, assuming my spike at the code is technically correct it is 1.17x faster than this PR:
|
BTW great PR and thanks for working on perf! I saw this link in Ruby Weekly and wanted to check it out. Always happy to see more people working on perf in the Ruby space :) |
Same thought. Yes, impressive speedup using recursive hash! Btw, I rewatched your great(!) "Slow Mo" presentation not so long time ago and considered this technique as well. But, as Mike mentioned, we basically optimizing "hello world" jobs here and most time will be spent in the jobs code itself, so not 100% sure it is worth increasing (not so much, as for me, but nevertheless) complexity more. |
Like I commented in #4308 and mhenrixon/sidekiq-unique-jobs#427 this merge broke sidekiq completely :) Sometimes performance isn't everything :) If you really need performance, instead of hunting weird ways of getting more performance. Maybe you should have a look at rust, go or crystal instead? |
https://code.dblock.org/2017/02/24/the-demonic-possession-of-hashie-mash.html is what you get when you try to do these things yourself (getting rid of json/marshal dump/parse) |
@mhenrixon I will investigate what broke your build. |
Wow, what a thread! Here is my two cents: use Refinements!
The improvement comparing to the recursive hash proposed by @schneems is negligible, and it's not a surprise—it works almost the same way but at a different level. https://gist.github.com/palkan/822f5b324d0278e98e95770d39095588 |
So...I actually did that in sprockets. Not sure which version of the "slo mo" talk that was watched but I actually got faster results with refinements than with a recursive hash. When I gave the talk at RubyConf in Cincinnatti Charles Nutter and Matz were in the audience and basically, both of them told me not to use refinements for performance. Turns out they are super not optimized on jruby but maybe that's no longer the case? You could probably ping him on twitter, he's pretty responsive. Here's my sprockets experiment rails/sprockets#417 |
I have thoughts about Hashie, especially when it comes to perf https://www.schneems.com/2014/12/15/hashie-considered-harmful.html |
@palkan @schneems Here's JRuby numbers with MRI 2.6.4 for comparison. tldr: refinements don't seem to be that slow but aren't the fastest result.
|
Hey @headius! Thanks for the benchmark. Glad to see that refinements are not slow 🙂 |
Let's start with the obvious: obj.each_with_index do |value, index|
duped[index] = deep_dup(value)
end could be the more idiomatic and faster: obj.map { |e| deep_dup(e) } On MRI, before:
After:
The same goes for JRuby and TruffleRuby BTW, I'll make a PR with that (#4313). |
It is a performance bug, I think MRI should inline cache the call to I think it's a good rule in general to file performance bugs to MRI when idiomatic code which performs the same semantics is slower than more verbose/less idiomatic code like here. For fun, here is what running the benchmark (using
It's almost an order of magnitude faster than MRI. There is a large error margin, because most time is spent in GC and allocations when the code is well optimized. Maybe we should make |
* On the benchmark from sidekiq#4303 on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s.
* Prefer the faster Array#map to a manual implementation * On the benchmark from #4303 on MRI 2.6.4, before 396657.9 i/s, after 461013.3 i/s. * Use `duped` as the return value in the only branch it's needed
So the cloning adds about 10% overhead on my Mac:
We can't freeze the args because of backwards compatibility -- there's likely middleware and LOTS of jobs which mutable their array or hash arguments somehow. We can't break them. There is a way to rejigger the code so that we don't need to clone: instead we keep the original String of JSON and re-parse it if we need access to the original, untouched job. So the moral of this story is: you can microbenchmark all you want but the fastest solution is no code at all. 😄 |
@mperham |
@eregon I think @mperham agreed, it's also hard to know if a micro-optimization affects the overall runtime of a system, or if we're debating over fractions of fractions. Rails started rejecting perf patches unless they can show that it made things faster in a meaningful way on a Rails app. To that end I coded up this feature in derailed benchmarks While it's geared towards Rails, there are docs on how to use it with another library such as sidekiq. There's not background job integration out of the box right now, but you can get it pretty easy doing something like this: zombocom/derailed_benchmarks#145 (comment)
It would be nice if there was a way to safely deprecate mutating an arg, I guess you could patch all the mutating core methods, though that sounds difficult. I'm wondering if there could potentially be a feature to make this easier? A callback proc when an object is mutated perhaps? |
@fatkodima Go ahead, it would be nice to get a fresh pair of eyes on the problem. The cloning was originally added so that the job_hash data that was sent to the Web UI in |
Profiling
bin/sidekiqload
, I got:For cpu:
Before:
After:
So,
14.0 - 9.1 = 5%
⚡️ cpu savingsFor object allocations:
Before:
After:
So,
28.0 - 16.9 = 11%
memory savings.Profiling with
memory_profiler
, I got more accurate results for memory:Before:
After:
So, savings in bytes:
(695737343 - 606804659) / 695737343.0 * 100 = 13%
⚡️Savings in objects:
(6904607 - 5704475) / 6904607.0 * 100 = 17.5%
⚡️Another isolated benchmark:
Results:
So while it seems that if-else is a bit uglier than case/when, it is much faster.