Don't use RwLocks #3

asonix · 2019-01-11T20:14:07Z

Using RwLocks in a tokio threadpool can cause performance issues since one future on one thread can block other futures on other threads, when those other threads could potentially be doing other work.

I don't have a clean solution for this example project using tokio_threadpool::blocking because you need to put blocking inside a poll_fn which requires values moved into the PollFn be Clone, but part of what we need inside these blocking sections is the write-half of a TcpStream, which is not Clone.

My idea for a way around this is we have a stream that folds over some state, and other futures interact with that state by sending and receiving messages.

The text was updated successfully, but these errors were encountered:

asonix · 2019-01-11T21:09:10Z

https://git.asonix.dog/asonix/mmo-rust-server

Here's how I'd do it

jaybutera · 2019-01-12T03:01:46Z

This is absolutely beautiful. I love your solution to cloning the sink, and you managed to rebuild the project with no mutexes. I can tell you're becoming very proficient in Rust.

So in a nutshell, your idea is to spawn an actor in the thread pool to handle client updates synchronously in a single task. In theory this sacrifices no speed from the original program because any task that locks the state for writing blocks any other task from using the state at the same time anyway. On top of that, the design also avoids tasks blocking as they wait on another task that has locked the resource. Instead of needing the lock, a task just passes a message to the mpsc queue for the actor to process at a later point.

I don't want to update this repository because it would no longer match the article. But what do you think about me doing a v2 article explaining your design?

asonix · 2019-01-12T03:18:00Z

Go for it!

I've been playing a lot with Tokio for various projects of mine, and this is the cleanest I've been able to come up with for doing asynchronous state-management with pure tokio. I do like Actix, though.

asonix · 2019-01-12T03:29:47Z

oh also, there's part of your code that does

match true {
    true => Ok(Loop::Continue())
    false => Ok(Loop::Break())
}

but you could do

Ok(Loop::Continue())

and it would do the same thing

kwarrick · 2019-01-22T01:29:05Z

@asonix It makes sense that a blocking write lock could potentially be very bad for performance, but my intuition is that a RwLock is ok so long as the critical section (lifetime) is small and fast.

I benchmarked the two solutions with artillery and found that the RwLock implementation actually has better performance.

I am definitely a novice when it comes to benchmarking, but I played with it for a while and I'm reasonably convinced.

config:
  target: "ws://127.0.0.1:8080/"
  phases:
    - duration: 60
      arrivalRate: 20
scenarios:
  - engine: "ws"
    flow:
      - loop:
        - send: "down"
        - send: "right"
        - send: "left"
        - send: "up"
        count: 10
      - think: 1

Note that I commented out println! statements in both projects, ran both with cargo run --release, and I added a max_y: 6 option to the RwLock graph so the scale would be the same.

asonix · 2019-01-22T05:02:43Z

Here's the really interesting thing, though.

Using artillery with your config, I'm getting opposite results from what you've shown. I'm wondering if this is an OS or CPU (or both) thing.

In this screenshot, the messaging implementation is on the left, and the lock implementation is on the right. the messaging implementation performs better in the p95 and p99, and in max latency.

Here's the specs for the machine I'm testing on:
Operating System: Solus 3.9999 with linux kernel 4.20.2
CPU: Intel Core i5-6600

I've run the tests a couple times, and get similar results each time.

note: Before testing each project, I've commented out all println!()s, and run cargo update to ensure latest dependencies

my branch: https://git.asonix.dog/asonix/mmo-rust-server/src/branch/asonix/artillery
jay's branch: https://git.asonix.dog/asonix/mmo-rust-server/src/branch/jay/artillery

asonix · 2019-01-22T20:24:21Z

In related news, tokio just merged some synchronization primatives: tokio-rs/tokio#839

This means that using mpsc and oneshot channels should be faster (if the tokio versions are used over the futures versions) and there's a semaphore and an AtomicTask primitive for exclusive data access.

kwarrick · 2019-01-22T23:33:33Z

I think my early "scenarios" were far too simplistic to find the problems with lock contention.

Originally I was testing with,
Operating System: macOS 10.14.2
CPU: i7-4770HQ

I switched to testing with,
Operating System: Ubuntu 18.04.1 @ 4.15.0
CPU: Intel i5-2400

and I got similar results to your own (messages win).

However, as I started to increase the complexity of the benchmark I saw locks start to win, again.

I really didn't want to believe locks would hurt performance that much. Mostly because reworking every shared data structure into a separate future, channel, and messages feels like cruft, but the impression started to build that while messaging might impose some constant in overhead initially it holds out under heavy load.

So, to really stress the server I think we need to add more "thinks" to the script so clients hang around to receive messages. I'd be interested to see if you have the same results, but when you add the "thinks" and increase the number of clients I believe you can see the lock contention begin to cause connection timeouts and a really significant degradation in connectivity.

I used a ulimit -n unlimited before cargo run, and my config is here:
https://gist.github.com/kwarrick/039761f00e1a41f92f97454eaf05af2f

asonix · 2019-01-23T19:16:04Z

I'm going to be honest, I don't know enough about benchmarking and artillery to understand what adding more thinks does. The docs say "To pause the virtual user for N seconds, use a think action" but does that mean pausing artillery? Does it pause in the middle of a request? between requests?

asonix · 2019-01-23T19:40:35Z

Another reason why my implementation may fare better than Jay's implementation when many requests are made is mine properly handles dropping the streams and sinks for disconnected clients. I'm not sure it has an effect over short time periods, since the server might not notice the client is gone immediately, but if artillery properly disconnects and doesn't just vanish, the server will immediately drop the associated stream and sink.

kwarrick · 2019-01-23T21:45:07Z

You are probably tired of this thread and my blundering, long-winded comments, but Jay your post was perfectly timed for a project I am working on. So, thank you and please forgive all my noise.

Just glanced at tokio-sync, and it looks really promising. Might make this all moot, but I really want to know how bad locks are in tokio futures.

I'm a bit stubborn so I couldn't let it go and decided to profile Jay's solution, and I've found that benchmarking them against each other is not a fair comparison.

First, I found valgrind's DRD tool, which can detect lock contention:

$ valgrind \
      --tool=drd \
      --exclusive-threshold=5 \
      --shared-threshold=5 \
      ./target/debug/game_server 2>&1 \
  | grep 'Lock on'
==7868== Lock on mutex 0x651f5e0 was held during 40 ms (threshold: 5 ms).
==7868== Lock on mutex 0x65217c0 was held during 24 ms (threshold: 5 ms).
==7868== Lock on rwlock 0x650eeb0 was held during 14 ms (threshold: 5 ms).

After startup, no significant contention is reported, even under heavy load. So, if the locks aren't causing the service degradation, what is? Right?

Back to valgrind:

$  valgrind --tool=callgrind ./target/debug/game_server

$ callgrind_annotate callgrind.out.10279 | less
...
--------------------------------------------------------------------------------
            Ir  file:function
--------------------------------------------------------------------------------
32,366,192,376  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5 [/lib/x86_64-linux-gnu/libc-2.27.so]
 3,694,475,221  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:_int_malloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 3,301,726,832  /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b//src/libcore/fmt/mod.rs:core::fmt::write [/home/warrick/locks/rwlock/target/release/game_server]
 2,086,559,182  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:free [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,886,001,052  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:realloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,361,337,220  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:_int_realloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,315,626,578  /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b//src/libcore/fmt/mod.rs:core::fmt::Formatter::pad_integral [/home/warrick/locks/rwlock/target/release/game_server]
...

It looks like memcpy, malloc, and fmt are dominating.

This brings me to why benchmarking the solutions against each other is not fair for comparing messaging and locking.

There is a logical memory leak; "entities" are added but never removed. So, each subsequent connection receives larger and larger game state.
Serialization is inside a loop, so entities are serialized redundantly for each connection, which, as callgrind shows, is very expensive.
Every "tick", all of the connections are removed and then re-added to the HashMap.

@asonix actually fixes all of these problems, so the benchmarks aren't actually revealing anything about locking vs messages.

I say "fixes", but I guess if the game is designed such that clients state should be persisted after disconnect, number one is not a problem and there is not really a "memory leak", but the game servers do behave differently.

Anyways, I used your solution asonix to refactor the code, and I am going to try to benchmark again to see if lock contention causes performance problems.

My changes: jaybutera:0d22d4fc636...kwarrick:9dc1698e94

Let me know what you think, and I will come back with my results if you are interested.

jaybutera · 2019-01-24T03:44:37Z

Nice catch I'm looking forward to seeing the results.

asonix · 2019-01-24T17:12:00Z

hey @kwarrick can I ask what you're doing with this sink here?

It doesn't seem to get used anywhere, and when a tx half of a channel is dropped, the rx side's stream ends.

asonix · 2019-01-24T17:13:33Z

oh nevermind, you insert it into connections

asonix · 2019-01-24T17:13:45Z

I misread that as id.sink instead of id, sink

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use RwLocks #3

Don't use RwLocks #3

asonix commented Jan 11, 2019

asonix commented Jan 11, 2019

jaybutera commented Jan 12, 2019

asonix commented Jan 12, 2019

asonix commented Jan 12, 2019

kwarrick commented Jan 22, 2019

asonix commented Jan 22, 2019 •

edited

asonix commented Jan 22, 2019

kwarrick commented Jan 22, 2019

asonix commented Jan 23, 2019

asonix commented Jan 23, 2019

kwarrick commented Jan 23, 2019 •

edited

jaybutera commented Jan 24, 2019

asonix commented Jan 24, 2019

asonix commented Jan 24, 2019

asonix commented Jan 24, 2019

Don't use RwLocks #3

Don't use RwLocks #3

Comments

asonix commented Jan 11, 2019

asonix commented Jan 11, 2019

jaybutera commented Jan 12, 2019

asonix commented Jan 12, 2019

asonix commented Jan 12, 2019

kwarrick commented Jan 22, 2019

asonix commented Jan 22, 2019 • edited

asonix commented Jan 22, 2019

kwarrick commented Jan 22, 2019

asonix commented Jan 23, 2019

asonix commented Jan 23, 2019

kwarrick commented Jan 23, 2019 • edited

jaybutera commented Jan 24, 2019

asonix commented Jan 24, 2019

asonix commented Jan 24, 2019

asonix commented Jan 24, 2019

asonix commented Jan 22, 2019 •

edited

kwarrick commented Jan 23, 2019 •

edited