sha256 has very wildly varying performance compared to ring between computers (same binary) #565

VorpalBlade · 2024-02-20T20:20:25Z

On my desktop ring and sha256 when build with level 2 optimisation and fat LTO are very close in performance. On my older laptop, the same exact binaries has a 1.76x performance difference with ring being faster.

To put this into more specific terms I decided to clone this repo (revision c38787b as it happened) and run your sha2 bench on each system. I hope I'm running this correctly: cargo +nightly bench --package=sha2 is the command I used on both (since you seem to use nightly benching, not criterion).

Ring and sha2 doesn't have comparable benchmarks unfortunately, so adapted yours (code at bottom of this issue for adding ring):

Desktop:

test sha256_10         ... bench:           6 ns/iter (+/- 0) = 1666 MB/s
test sha256_100        ... bench:          51 ns/iter (+/- 0) = 1960 MB/s
test sha256_1000       ... bench:         441 ns/iter (+/- 1) = 2267 MB/s
test sha256_10000      ... bench:       4,348 ns/iter (+/- 9) = 2299 MB/s

test sha256_ring_10    ... bench:           7 ns/iter (+/- 0) = 1428 MB/s
test sha256_ring_100   ... bench:          50 ns/iter (+/- 0) = 2000 MB/s
test sha256_ring_1000  ... bench:         442 ns/iter (+/- 1) = 2262 MB/s
test sha256_ring_10000 ... bench:       4,348 ns/iter (+/- 8) = 2299 MB/s

test sha512_10         ... bench:          15 ns/iter (+/- 1) = 666 MB/s
test sha512_100        ... bench:         137 ns/iter (+/- 0) = 729 MB/s
test sha512_1000       ... bench:       1,320 ns/iter (+/- 1) = 757 MB/s
test sha512_10000      ... bench:      12,957 ns/iter (+/- 27) = 771 MB/s

test sha512_ring_10    ... bench:          17 ns/iter (+/- 0) = 588 MB/s
test sha512_ring_100   ... bench:         136 ns/iter (+/- 0) = 735 MB/s
test sha512_ring_1000  ... bench:       1,297 ns/iter (+/- 4) = 771 MB/s
test sha512_ring_10000 ... bench:      12,869 ns/iter (+/- 29) = 777 MB/s

Laptop:

test sha256_10         ... bench:          43 ns/iter (+/- 0) = 232 MB/s
test sha256_100        ... bench:         410 ns/iter (+/- 17) = 243 MB/s
test sha256_1000       ... bench:       3,983 ns/iter (+/- 143) = 251 MB/s
test sha256_10000      ... bench:      39,179 ns/iter (+/- 423) = 255 MB/s

test sha256_ring_10    ... bench:          29 ns/iter (+/- 0) = 344 MB/s
test sha256_ring_100   ... bench:         241 ns/iter (+/- 1) = 414 MB/s
test sha256_ring_1000  ... bench:       2,244 ns/iter (+/- 25) = 445 MB/s
test sha256_ring_10000 ... bench:      21,814 ns/iter (+/- 115) = 458 MB/s

test sha512_10         ... bench:          25 ns/iter (+/- 0) = 400 MB/s
test sha512_100        ... bench:         227 ns/iter (+/- 3) = 440 MB/s
test sha512_1000       ... bench:       2,131 ns/iter (+/- 21) = 469 MB/s
test sha512_10000      ... bench:      20,873 ns/iter (+/- 518) = 479 MB/s

test sha512_ring_10    ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512_ring_100   ... bench:         171 ns/iter (+/- 2) = 584 MB/s
test sha512_ring_1000  ... bench:       1,593 ns/iter (+/- 50) = 627 MB/s
test sha512_ring_10000 ... bench:      15,198 ns/iter (+/- 610) = 657 MB/s

I want to stress, this is the exact same binary both machines are running. Something really strange is going on with sha2 here, and whatever that is should be fixed. I would expect both ring and sha2 to scale similarly when moving between machines. Here are the specs for each machine. I expect that it is the CPU that is most interesting.

Desktop:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 5600X 6-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  73%
    CPU max MHz:         4650,2920
    CPU min MHz:         2200,0000
    BogoMIPS:            7402,12
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good
                          nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy
                          svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstat
                         e ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cq
                         m_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassi
                         sts pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    3 MiB (6 instances)
  L3:                    32 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; Safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Laptop:

$ lscpu             
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
    CPU family:          6
    Model:               142
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU(s) scaling MHz:  20%
    CPU max MHz:         4000,0000
    CPU min MHz:         400,0000
    BogoMIPS:            3999,93
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art a
                         rch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse
                         4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpi
                         d ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_no
                         tify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

Both machines have 32 GB RAM. The laptop is a Thinkpad T480. The desktop I built myself myself.

Finally here is the code I added at the end of the sha2 bench (I also ran cargo add --package=sha2 --dev ring):

#[macro_export]
macro_rules! bench_update_ring {
    (
        $init:expr;
        $($name:ident $bs:expr;)*
    ) => {
        $(
            #[bench]
            fn $name(b: &mut Bencher) {
                let mut d = $init;
                let data = [0; $bs];

                b.iter(|| {
                    d.update(&data[..]);
                });

                b.bytes = $bs;
            }
        )*
    };
}

bench_update_ring!(
    ring::digest::Context::new(&ring::digest::SHA256);
    sha256_ring_10 10;
    sha256_ring_100 100;
    sha256_ring_1000 1000;
    sha256_ring_10000 10000;
);

bench_update_ring!(
    ring::digest::Context::new(&ring::digest::SHA512);
    sha512_ring_10 10;
    sha512_ring_100 100;
    sha512_ring_1000 1000;
    sha512_ring_10000 10000;
);

Obviously I'm going with ring for my use case, since it is much faster than sha2 on one of the computers I will run it on.
I plan on compiling for baseline x86-64 regardless since I want a redistributable binary, so targeting the native CPU for each is not really relevant.

Let me know if you need any additional info, because I don't believe sha2 should scale so differently than ring.

The text was updated successfully, but these errors were encountered:

newpavlov · 2024-02-21T05:14:12Z

Nothing strange here. sha2 supports autodetection of target features and on x86 it will use SHA-NI instructions if they are available on host CPU. Your AMD CPU has them and the Intel one does not. We know that our software backend is somewhat slower than ring implementation, see this issue for more information: #327

tarcieri · 2024-02-21T15:10:33Z

Closing in favor of #327

tarcieri closed this as not planned Won't fix, can't repro, duplicate, stale Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sha256 has very wildly varying performance compared to ring between computers (same binary) #565

sha256 has very wildly varying performance compared to ring between computers (same binary) #565

VorpalBlade commented Feb 20, 2024

newpavlov commented Feb 21, 2024

tarcieri commented Feb 21, 2024

sha256 has very wildly varying performance compared to ring between computers (same binary) #565

sha256 has very wildly varying performance compared to ring between computers (same binary) #565

Comments

VorpalBlade commented Feb 20, 2024

newpavlov commented Feb 21, 2024

tarcieri commented Feb 21, 2024