Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shift in performance characteristic between 1.6.2 and master #848

Open
Licenser opened this issue Aug 3, 2020 · 18 comments
Open

Shift in performance characteristic between 1.6.2 and master #848

Licenser opened this issue Aug 3, 2020 · 18 comments

Comments

@Licenser
Copy link
Contributor

Licenser commented Aug 3, 2020

NOTE: Not a complaint just an observation that might be useful to discuss / design / improve that scheduling model.

Hi I ran our benchmarks with 1.6.2 and current master after #836 was merged and I noticed a rather significant change in performance characteristics.

The following two tables show the throughput of MB/s in the benchmark at a different number of concurrent tasks and different CPU setups.

The benchmarks are run on a Ryzen Threadripper 24 Core (48 thread) CPU. It's worth noting that these CPU houses (simplified) 8 CCXs (3 cores/6 threads each) that share their cache, otherwise they communicate over an internal bus (somewhat acting as different NUMA domains).

So in the 3 core test, the three cores specifically are picked to be on the same CCX, meaning they share their cache, in the 48 core tests all cores (or threads) are used meaning no chances are shared by default.

It looks like that in 1.6.2 tasks that were related were rather likely to a be scheduled on the same executor (or an executor on the same CCX) so the 3 steps in the benchmark, deserialize -> process -> serialize were more often executed on the same CCX (leading to higher performance in the 48 thread case) however overall was less performant (in the 3 core case).

master switches this around, it seems to more often than not schedule-related tasks on new cores leading to a (sometimes much) lower performance on the 48 core case but much better performance on the limited core set with the shared cache.

We observed the same with threads (to a stronger degree), that threads outperformed tasks on the 3 core setup but unless pinned would be scheduled around so much that the loss of cache would obliterate performance.

It seems that master is very friendly to UMA systems (or generally systems where all cores share cache) but less friendly to NUMA systems that don't have that luxury (counting the Ryzen platform here even so it's a bit of a special case).

1.6.2 (throughput in MB/s)

# tasks/2 3 cores 48 cores
1 181.7 84.9
2 276.1 337.3
4 291.5 333.3
8 260.9 325.9
16 254.7 320.2
32 250.3 302.8
64 243.2 305.8

master (throughput in MB/s)

# tasks/2 3 cores 48 cores
1 273.8 128.6
2 339.1 292.3
4 342.7 234.2
8 308.3 209.9
16 265.6 210.9
32 245.9 201.2
64 229.8 197.4
@yoshuawuyts
Copy link
Contributor

cc/ @stjepang

@ghost
Copy link

ghost commented Aug 3, 2020

Ok this actually seems like great news. I'm pretty confident I know what's going on and am hopeful that we can get the best of both worlds! I'll make an experiment later today and make a branch of async-std so we can compare it to 1.6.2 and master.

Briefly:

  • 1.6.2 was notifying executor threads in a silly round-robin manner, while master notifies executor threads in a LIFO manner. This is better because by waking the thread that went to sleep most recently, we have the highest chances of keeping caches warm. I believe this is what contributes to better performance on UMA systems.
  • 1.6.2 schedules tasks onto the local executor thread, while master schedules tasks onto the global queue (lots of contention!) because I was lazy when making multitask. I believe this is what makes performance worse on NUMA systems and is what I'll try fixing today.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 3, 2020

Ok this actually seems like great news. I'm pretty confident I know what's going on and am hopeful that we can get the best of both worlds! I'll make an experiment later today and make a branch of async-std so we can compare it to 1.6.2 and master.

I totally agree :) again, this wasn't meant as a complaint at all. We are mostly on the left side (UMA + few cores) anyway in production so the speed boost here is absolutely great!

I just have the thread ripper to work/run benchmarks on and figured that change in behavior is something worth sharing as if nothing else it's interesting and perhaps not obvious.

BTW great work on the changes to master this comes really really close to thread performance <3.

as a side note: master is within 10-15% of what we used to get with ideally laid out threaded code :) (i.e. 3 threads 3 cores, each working over a crossbeam::channel doing roughly the same amount of work) this is really amazing!

@ghost
Copy link

ghost commented Aug 3, 2020

@Licenser Could you perhaps try this branch? https://github.com/stjepang/async-std/tree/experiment

@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2020

Sorry for the late reply the tests take a while :)

I re-run the 1.6.2 branch for a fair comparison since I did some CPU fiddling yesterday evening and wanted to make sure it's a fair comparison.

The general bump in throughput is due to (the aforementioned CPU fiddling) but the delta between the 48 thread variants looks slightly lower

1.6.2 (throughput in MB/s)

# tasks/2 3 cores 48 threads
1 133.0 111.4
2 313.1 365.8
4 331.9 374.2
8 300.3 357.2
16 287.6 341.5
32 279.4 327.8
64 267.1 330.5

experiment (throughput in MB/s)

# tasks/2 3 cores 48 threads
1 381.2 138.6
2 421.3 324.6
4 378.8 253.4
8 353.9 244.6
16 324.7 239.8
32 315.3 233.4
64 319.7 227.1

ratio 1.6.2 / candidate

# tasks / 2 master experiment
1 0,6602 0,8038
2 1,1553 1,1269
4 1,4231 1,4767
8 1,5526 1,4603
16 1,5183 1,4241
32 1,505 1,4045
64 1,5491 1,4553

perf

for the completeness I added a perf recording of the 64 / 48 run

   1,78%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*, unsigned int)::$_0::operator()                                                                                                                                                                ◆
   1,67%  rustc            rustc                                        [.] free                                                                                                                                                                                                                                                                ▒
   1,63%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::Linker::linkInModule                                                                                                                                                                                                                                          ▒
   1,19%  rustc            rustc                                        [.] malloc                                                                                                                                                                                                                                                              ▒
   1,04%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] combineInstructionsOverFunction                                                                                                                                                                                                                                     ▒
   0,93%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::PointerMayBeCaptured                                                                                                                                                                                                                                          ▒
   0,71%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::removeUnreachableBlocks                                                                                                                                                                                                                                       ▒
   0,71%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                                                                                                           ▒
   0,69%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] computeKnownBitsFromOperator                                                                                                                                                                                                                                        ▒
   0,66%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::ValueHandleBase::AddToUseList                                                                                                                                                                                                                                 ▒
   0,65%  async-std/runti  tremor-server                                [.] rust_alloc                                                                                                                                                                                                                                                          ▒
   0,60%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::Instruction::getMetadataImpl                                                                                                                                                                                                                                  ▒
   0,54%  rustc            libc-2.31.so                                 [.] __memcmp_avx2_movbe                                                                                                                                                                                                                                                 ▒
   0,48%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::DILocation::getImpl                                                                                                                                                                                                                                           ▒
   0,44%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::SelectionDAG::Combine                                                                                                                                                                                                                                         ▒
   0,43%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::MetadataTracking::track                                                                                                                                                                                                                                       ▒
   0,42%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::MetadataTracking::untrack                                                                                                                                                                                                                                     ▒
   0,40%  async-std/runti  tremor-server                                [.] simd_json::stage2::<impl simd_json::Deserializer>::build_tape                                                                                                                                                                                                       ▒
   0,39%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] (anonymous namespace)::CallAnalyzer::analyze                                                                                                                                                                                                                        ▒
   0,39%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::DataLayout::getTypeSizeInBits                                                                                                                                                                                                                                 ▒
   0,38%  async-std/runti  [kernel.kallsyms]                            [k] psi_task_change                                                                                                                                                                                                                                                     ▒
   0,37%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::PMTopLevelManager::findAnalysisPass                                                                                                                                                                                                                           ▒
   0,36%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::CallBase::onlyReadsMemory                                                                                                                                                                                                                                     ▒
   0,36%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::InlineFunction                                                                                                                                                                                                                                                ▒
   0,35%  async-std/runti  libc-2.31.so                                 [.] __memmove_avx_unaligned_erms                                                                                                                                                                                                                                        ▒
   0,35%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::PMDataManager::removeNotPreservedAnalysis                                                                                                                                                                                                                     ▒
   0,34%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] eliminateDeadStores                                                                                                                                                                                                                                                 ▒
   0,34%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] (anonymous namespace)::LiveDebugValues::ExtendRanges                                                                                                                                                                                                                ▒
   0,33%  async-std/runti  tremor-server                                [.] simd_json::value::borrowed::serialize::FastGenerator::write_json                                                                                                                                                                                                    ▒
   0,31%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] computeKnownBits                                                                                                                                                                                                                                                    ▒
   0,31%  rustc            libLLVM-10-rust-1.45.0-stable.so             [.] llvm::wouldInstructionBeTriviallyDead                                                                                                                                                                                                                               ▒
   0,30%  async-std/runti  [kernel.kallsyms]                            [k] delay_mwaitx                                                                                                                                                                                                                                                        ▒
   0,29%  rustc            libc-2.31.so                                 [.] __memmove_sse2_unaligned_erms ```

@ghost
Copy link

ghost commented Aug 4, 2020

Thank you for such a detailed report! Two more questions:

  1. When testing with async-std v1.6.2, what is the smol version used (what version appears in Cargo.lock or cargo tree)?
  2. Where is this benchmark from? I'd like to run it on my own to investigate further...

@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2020

No worries, I appreciate the help! smol and async-std are incredible :D

  1. cargo tree satys (from the top level it's imported again):
├── async-std v1.6.2
│   └── smol v0.1.18
├── smol v0.3.3
  1. https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait
    you can compile it with cargo build -p tremor-server --release and then run sh bench/scale.sh to run the stack of tests. It uses taskset -c 0,1,2 in there (you might need to adjust that to fit your system)

@ghost
Copy link

ghost commented Aug 4, 2020

Do you know how to fix this error?

tremor-runtime source-trait $ target/release/tremor-server --no-api -c ./bench/real-workflow-throughput-json-8-8-1.yaml bench/link.yaml
tremor version: 0.9.0
rd_kafka version: 0x000002ff, 1.4.2
[2020-08-04T14:25:57Z ERROR tremor_pipeline::op::runtime::tremor] Error:
        1 | use std::array;
          |                 Module `std/array` not found or not readable error in module path:
        2 | match event.application of
        3 |   case "app1" => let $class = "applog_app1",  let $rate = 1250, let $dimension = event.application, emit event


[2020-08-04T14:25:57Z ERROR tremor_server] error: Module `std/array` not found or not readable error in module path:
error: Module `std/array` not found or not readable error in module path:

@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2020

sorry about that, you got to tell tremor where it's standard lib is. this should work:

export TREMOR_PATH=./bench/../tremor-script/lib

@ghost
Copy link

ghost commented Aug 4, 2020

Thank you, that worked! :) I haven't made much progress, however, because compilation is really really slow. Do you think it would be possible to make a benchmark that does roughly the same thing, but in a simpler program that is quicker to compile?

@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2020

I'll give it a try tomorrow morning, a simple deserialize on one side, run some script in the middle serialize on the other might be good enough (famous last words) I'll see how that goes, it lacks a bit of fidelity but perhaps it models it well enough

@Licenser
Copy link
Contributor Author

Licenser commented Aug 5, 2020

Heya I added a little (very simplified) test:

cargo build -p runtime-bench --release && strip target/release/runtime-bench
for i in 1 2 4 8 16 32 64 128
do
  taskset -c 0,1,2 ./target/release/runtime-bench $i
  ./target/release/runtime-bench $i  
done

can run it and with a bit of text editing it spits out some throughput measurements.

It's not a true replication of the original benchmark it leaves out the whole return channel shenanigans we're doing, as well as the mini interpreter we run against each event. The differences are not as big but I think we can observe the same pattern: in 1.6.2 low core count/shared cache are slower but high core count / no shared cache is faster compared to experiment.

I'm unsure why it says 0 for a concurrency of 128 the code shouldn't be locking up. (this was thanks to the wonders of non-monotinic time <3, I updated the numbers)

experiment

concurrency 3c 48c
1 382.3 MB/s 244.1 MB/s
2 486.5 MB/s 238.6 MB/s
4 425.3 MB/s 199.0 MB/s
8 312.6 MB/s 205.3 MB/s
16 282.6 MB/s 201.6 MB/s
32 452.8 MB/s 193.4 MB/s
64 433.4 MB/s 186.3 MB/s
128 398.4 MB/s 182.9 MB/s

1.6.2

concurrency 3c 48c
1 320.2 MB/s 198.6 MB/s
2 275.0 MB/s 240.3 MB/s
4 342.5 MB/s 262.0 MB/s
8 381.5 MB/s 274.2 MB/s
16 376.2 MB/s 289.4 MB/s
32 371.6 MB/s 330.1 MB/s
64 369.3 MB/s 394.7 MB/s
128 381.0 MB/s 396.2 MB/s

-- edit below

threads

For the sake of it I translated the benchmark to threads as well (same repo thread-bench)

concurrency 3c 48c
1 341.4 MB/s 146.0 MB/s
2 416.6 MB/s 296.2 MB/s
4 284.0 MB/s 269.9 MB/s
8 143.7 MB/s 203.6 MB/s
16 46.2 MB/s 190.9 MB/s
32 17.4 MB/s 184.6 MB/s
64 4.0 MB/s 164.4 MB/s
128 0.8 MB/s 153.6 MB/s

code is here:
https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait/thread-bench
https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait/runtime-bench

@Licenser
Copy link
Contributor Author

Licenser commented Aug 6, 2020

Since it ended up not needing any dependencies on tremor specific code I moved this out to it's own repo (less cloning / .lock stuff ): https://github.com/wayfair-tremor/rt-bench

@ghost
Copy link

ghost commented Sep 7, 2020

Some thoughts... Now that async-executor is stable (docs), I wonder if it would be more worthwhile to detect NUMA nodes and then create one executor and a threadpool per NUMA node. What do you think, have you perhaps tried that? It may also make sense for async-std to figure out the NUMA architecture on the system and then create a threadpool per node.

Looking at the numbers again, it seems that 3 cores are always faster than 48 cores :) So perhaps being less efficient makes you faster at higher core counts (ironically). I don't have proof for this hypothesis, but it might be why v1.6.2 is winning at 48c benchmarks.

@Licenser
Copy link
Contributor Author

Licenser commented Sep 8, 2020

I've been thinking about that too, but I'm not super sure how to best approach that.

I think to automate that tasks would need to "understand" how they are related. i.e. if task-a and task-b communicate they probably want to share a NUMA-executor, if they don't they probably don't want to.

It would be super cool to be able to describe those relations but to be fair I'm not sure if the overhead of that would be slowing it down (due to extra logic etc.).

The whole thing gets a bit more complicated with systems like Ryzen where NUMA isn't enough to explain the architecture. Even so a CCD exposes itself as a single NUMA node the two CCX's don't share all the caches so having a task move between CCX's inside a CCD would still cause a slowdown. I found no other way to determine that other then measuring distance with crossbeam-channels (see: https://github.com/Licenser/trb) so if there are NUMA aware executors perhaps making the grouping of cores pluggable would be nice. That way it would be possible to write discoverers to create groupings.

@ghost
Copy link

ghost commented Sep 8, 2020

I'm curious: when deploying this to production, do you use all available cores or just a small number of them (which I would expect to deliver the best performance)?

Some interesting reading:

EDIT: fixed the second link

@Licenser
Copy link
Contributor Author

Licenser commented Sep 8, 2020

The answer to that is tricky, so the answer to deployment will be a bit longer ;)

tremor.rs (where we use smol/async-std) is open source so we often have no control over the deployment. Internally at Wayfair, we currently deploy mostly on VMs with small core numbers (2c4t) so they will be on the same NUMA node. (That deployment strategy isn't ideal for everyone.)

We recently moved from threads to async (some writeup on that https://www.tremor.rs/blog/2020-08-06-to-async-or-not-to-async/ ) with the goal to include parallelism where possible, i.e. instead of having 3 threads on a system we might have 30 tasks. Naive thread implementations don't scale well, async and async-std/smol do a lot of heavy lifting (yay thank you! <3).

For benchmarks, we run on both Intel and AMD systems to get a feeling of how the software behaves there, given the characteristics of the platform are different enough now that simply optimizing for one might bite you on the other. And the differences are quite interesting especially since unlike NUMA it's not easily / automatically discoverable.

I've read the 3rd article, can't access the 2nd (thanks google!) and the 1st one is new :) I'll go take a read.

@Licenser
Copy link
Contributor Author

cc @mfelsche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants