New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Oj.dump
performance
#674
Conversation
I retrieved a benchmark result on Intel Mac.
Env
Before
After
|
93ec83a
to
c37668a
Compare
I made a few comments but please don't take that the wrong way. Your catch on the use of strncpy was a good one. I'd be curious if memcpy would be a bit better. Would you mind running the benchmarks to compare? As for the other comments, I'd like to keep the constants on the ones that always return the same length but the two that use the sprintf return value are keepers so it is better to remove the non-used calculations of len earlier in the code. |
This patch uses standard C library to copy the string because copying one byte at a time is slow. This patch will improve `Oj.dump` performance as following. - | before | after | result -- | -- | -- | -- Oj.dump | 689.236k | 1.853M | 2.69x Oj.dump (compat) | 476.107k | 827.446k | 1.74x Oj.dump (rails) | 464.545k | 644.494k | 1.39x ### Environment - MacBook Air (M1, 2020) - macOS 12.0 beta 3 - Apple M1 - Ruby 3.0.2 ### Before ``` Warming up -------------------------------------- Oj.dump 69.210k i/100ms Oj.dump (compat) 47.123k i/100ms Oj.dump (rails) 45.911k i/100ms Calculating ------------------------------------- Oj.dump 689.236k (± 0.2%) i/s - 3.460M in 5.020801s Oj.dump (compat) 476.107k (± 0.9%) i/s - 2.403M in 5.048128s Oj.dump (rails) 464.545k (± 0.9%) i/s - 2.341M in 5.040711s ``` ### After ``` Warming up -------------------------------------- Oj.dump 187.096k i/100ms Oj.dump (compat) 82.879k i/100ms Oj.dump (rails) 64.371k i/100ms Calculating ------------------------------------- Oj.dump 1.853M (± 0.3%) i/s - 9.355M in 5.049406s Oj.dump (compat) 827.446k (± 0.2%) i/s - 4.144M in 5.008145s Oj.dump (rails) 644.494k (± 0.2%) i/s - 3.283M in 5.093814s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { 'short_string': 'a' * 50, 'long_string': 'b' * 255, 'utf8_string': 'あいうえお' * 10 } Benchmark.ips do |x| x.report('Oj.dump') { Oj.dump(data) } x.report('Oj.dump (compat)') { Oj.dump(data, mode: :compat) } x.report('Oj.dump (rails)') { Oj.dump(data, mode: :rails) } end ```
Thank you for the nice improvement. That will go nicely with the parser re-write I'm in the middle of but I'll release your improvement sooner than the parser. |
This patch uses standard C library to copy the string. (Ref. ohler55#674) − | before | after | result -- | -- | -- | -- Oj.dump | 1.046M | 1.102M | 1.054x ### Environment - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### Before ``` Warming up -------------------------------------- Oj.dump 106.035k i/100ms Calculating ------------------------------------- Oj.dump 1.046M (± 1.0%) i/s - 15.799M in 15.098842s ``` ### After ``` Warming up -------------------------------------- Oj.dump 112.786k i/100ms Calculating ------------------------------------- Oj.dump 1.102M (± 1.1%) i/s - 16.580M in 15.051956s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { float: 3.141592653589793, fixnum: 2 ** 60 } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data, mode: :compat) } end ```
This patch uses standard C library to copy the string. (Ref. #674) − | before | after | result -- | -- | -- | -- Oj.dump | 1.046M | 1.102M | 1.054x ### Environment - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### Before ``` Warming up -------------------------------------- Oj.dump 106.035k i/100ms Calculating ------------------------------------- Oj.dump 1.046M (± 1.0%) i/s - 15.799M in 15.098842s ``` ### After ``` Warming up -------------------------------------- Oj.dump 112.786k i/100ms Calculating ------------------------------------- Oj.dump 1.102M (± 1.1%) i/s - 16.580M in 15.051956s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { float: 3.141592653589793, fixnum: 2 ** 60 } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data, mode: :compat) } end ```
Maybe, the standard C library may use SIMD instructions, so it is faster than our own code. Similar: - ohler55#734 - ohler55#674 − | before | after | result -- | -- | -- | -- Oj.dump (macOS) | 1.699M | 2.020M | 1.189x Oj.dump (Linux) | 1.849M | 2.260M | 1.222x ### Environment - macOS - macOS 12.1 - Apple M1 Max - Apple clang version 13.0.0 (clang-1300.0.29.30) - Ruby 3.1.0 - Linux - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### macOS #### Before ``` Warming up -------------------------------------- Oj.dump 169.730k i/100ms Calculating ------------------------------------- Oj.dump 1.699M (± 0.7%) i/s - 25.629M in 15.089624s ``` #### After ``` Warming up -------------------------------------- Oj.dump 201.206k i/100ms Calculating ------------------------------------- Oj.dump 2.020M (± 0.9%) i/s - 30.382M in 15.044372s ``` ### Linux #### Before ``` Warming up -------------------------------------- Oj.dump 180.943k i/100ms Calculating ------------------------------------- Oj.dump 1.849M (± 1.1%) i/s - 27.865M in 15.072276s ``` #### After ``` Warming up -------------------------------------- Oj.dump 224.695k i/100ms Calculating ------------------------------------- Oj.dump 2.260M (± 1.4%) i/s - 33.929M in 15.012352s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { true: (0..10).map { true }, false: (0..10).map { false }, null: (0..10).map { nil }, } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data) } end ```
Maybe, the standard C library may use SIMD instructions, so it is faster than our own code. Similar: - #734 - #674 − | before | after | result -- | -- | -- | -- Oj.dump (macOS) | 1.699M | 2.020M | 1.189x Oj.dump (Linux) | 1.849M | 2.260M | 1.222x ### Environment - macOS - macOS 12.1 - Apple M1 Max - Apple clang version 13.0.0 (clang-1300.0.29.30) - Ruby 3.1.0 - Linux - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### macOS #### Before ``` Warming up -------------------------------------- Oj.dump 169.730k i/100ms Calculating ------------------------------------- Oj.dump 1.699M (± 0.7%) i/s - 25.629M in 15.089624s ``` #### After ``` Warming up -------------------------------------- Oj.dump 201.206k i/100ms Calculating ------------------------------------- Oj.dump 2.020M (± 0.9%) i/s - 30.382M in 15.044372s ``` ### Linux #### Before ``` Warming up -------------------------------------- Oj.dump 180.943k i/100ms Calculating ------------------------------------- Oj.dump 1.849M (± 1.1%) i/s - 27.865M in 15.072276s ``` #### After ``` Warming up -------------------------------------- Oj.dump 224.695k i/100ms Calculating ------------------------------------- Oj.dump 2.260M (± 1.4%) i/s - 33.929M in 15.012352s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { true: (0..10).map { true }, false: (0..10).map { false }, null: (0..10).map { nil }, } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data) } end ```
This patch uses standard C library to copy the string. (Ref. ohler55#674) − | before | after | result -- | -- | -- | -- Oj.dump | 1.046M | 1.102M | 1.054x ### Environment - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### Before ``` Warming up -------------------------------------- Oj.dump 106.035k i/100ms Calculating ------------------------------------- Oj.dump 1.046M (± 1.0%) i/s - 15.799M in 15.098842s ``` ### After ``` Warming up -------------------------------------- Oj.dump 112.786k i/100ms Calculating ------------------------------------- Oj.dump 1.102M (± 1.1%) i/s - 16.580M in 15.051956s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { float: 3.141592653589793, fixnum: 2 ** 60 } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data, mode: :compat) } end ```
Maybe, the standard C library may use SIMD instructions, so it is faster than our own code. Similar: - ohler55#734 - ohler55#674 − | before | after | result -- | -- | -- | -- Oj.dump (macOS) | 1.699M | 2.020M | 1.189x Oj.dump (Linux) | 1.849M | 2.260M | 1.222x ### Environment - macOS - macOS 12.1 - Apple M1 Max - Apple clang version 13.0.0 (clang-1300.0.29.30) - Ruby 3.1.0 - Linux - Zorin OS 16 - AMD Ryzen 7 5700G - gcc version 11.1.0 - Ruby 3.1.0 ### macOS #### Before ``` Warming up -------------------------------------- Oj.dump 169.730k i/100ms Calculating ------------------------------------- Oj.dump 1.699M (± 0.7%) i/s - 25.629M in 15.089624s ``` #### After ``` Warming up -------------------------------------- Oj.dump 201.206k i/100ms Calculating ------------------------------------- Oj.dump 2.020M (± 0.9%) i/s - 30.382M in 15.044372s ``` ### Linux #### Before ``` Warming up -------------------------------------- Oj.dump 180.943k i/100ms Calculating ------------------------------------- Oj.dump 1.849M (± 1.1%) i/s - 27.865M in 15.072276s ``` #### After ``` Warming up -------------------------------------- Oj.dump 224.695k i/100ms Calculating ------------------------------------- Oj.dump 2.260M (± 1.4%) i/s - 33.929M in 15.012352s ``` ### Test code ```ruby require 'benchmark/ips' require 'oj' data = { true: (0..10).map { true }, false: (0..10).map { false }, null: (0..10).map { nil }, } Benchmark.ips do |x| x.time = 15 x.report('Oj.dump') { Oj.dump(data) } end ```
This patch uses standard C library to copy the string
because copying one byte at a time is slow.
This patch will improve
Oj.dump
performance as following.Environment
Before
After
Test code