Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use RwLocks #3

Open
asonix opened this issue Jan 11, 2019 · 15 comments
Open

Don't use RwLocks #3

asonix opened this issue Jan 11, 2019 · 15 comments

Comments

@asonix
Copy link

asonix commented Jan 11, 2019

Using RwLocks in a tokio threadpool can cause performance issues since one future on one thread can block other futures on other threads, when those other threads could potentially be doing other work.

I don't have a clean solution for this example project using tokio_threadpool::blocking because you need to put blocking inside a poll_fn which requires values moved into the PollFn be Clone, but part of what we need inside these blocking sections is the write-half of a TcpStream, which is not Clone.

My idea for a way around this is we have a stream that folds over some state, and other futures interact with that state by sending and receiving messages.

@asonix
Copy link
Author

asonix commented Jan 11, 2019

https://git.asonix.dog/asonix/mmo-rust-server

Here's how I'd do it

@jaybutera
Copy link
Owner

This is absolutely beautiful. I love your solution to cloning the sink, and you managed to rebuild the project with no mutexes. I can tell you're becoming very proficient in Rust.

So in a nutshell, your idea is to spawn an actor in the thread pool to handle client updates synchronously in a single task. In theory this sacrifices no speed from the original program because any task that locks the state for writing blocks any other task from using the state at the same time anyway. On top of that, the design also avoids tasks blocking as they wait on another task that has locked the resource. Instead of needing the lock, a task just passes a message to the mpsc queue for the actor to process at a later point.

I don't want to update this repository because it would no longer match the article. But what do you think about me doing a v2 article explaining your design?

@asonix
Copy link
Author

asonix commented Jan 12, 2019

Go for it!

I've been playing a lot with Tokio for various projects of mine, and this is the cleanest I've been able to come up with for doing asynchronous state-management with pure tokio. I do like Actix, though.

@asonix
Copy link
Author

asonix commented Jan 12, 2019

oh also, there's part of your code that does

match true {
    true => Ok(Loop::Continue())
    false => Ok(Loop::Break())
}

but you could do

Ok(Loop::Continue())

and it would do the same thing

@kwarrick
Copy link

@asonix It makes sense that a blocking write lock could potentially be very bad for performance, but my intuition is that a RwLock is ok so long as the critical section (lifetime) is small and fast.

I benchmarked the two solutions with artillery and found that the RwLock implementation actually has better performance.

I am definitely a novice when it comes to benchmarking, but I played with it for a while and I'm reasonably convinced.

rwlock
mesages

config:
  target: "ws://127.0.0.1:8080/"
  phases:
    - duration: 60
      arrivalRate: 20
scenarios:
  - engine: "ws"
    flow:
      - loop:
        - send: "down"
        - send: "right"
        - send: "left"
        - send: "up"
        count: 10
      - think: 1

Note that I commented out println! statements in both projects, ran both with cargo run --release, and I added a max_y: 6 option to the RwLock graph so the scale would be the same.

@asonix
Copy link
Author

asonix commented Jan 22, 2019

Here's the really interesting thing, though.

Using artillery with your config, I'm getting opposite results from what you've shown. I'm wondering if this is an OS or CPU (or both) thing.

In this screenshot, the messaging implementation is on the left, and the lock implementation is on the right. the messaging implementation performs better in the p95 and p99, and in max latency.

Here's the specs for the machine I'm testing on:
Operating System: Solus 3.9999 with linux kernel 4.20.2
CPU: Intel Core i5-6600

I've run the tests a couple times, and get similar results each time.

screenshot from 2019-01-21 22-52-35

note: Before testing each project, I've commented out all println!()s, and run cargo update to ensure latest dependencies

my branch: https://git.asonix.dog/asonix/mmo-rust-server/src/branch/asonix/artillery
jay's branch: https://git.asonix.dog/asonix/mmo-rust-server/src/branch/jay/artillery

@asonix
Copy link
Author

asonix commented Jan 22, 2019

In related news, tokio just merged some synchronization primatives: tokio-rs/tokio#839

This means that using mpsc and oneshot channels should be faster (if the tokio versions are used over the futures versions) and there's a semaphore and an AtomicTask primitive for exclusive data access.

@kwarrick
Copy link

I think my early "scenarios" were far too simplistic to find the problems with lock contention.

Originally I was testing with,
Operating System: macOS 10.14.2
CPU: i7-4770HQ

I switched to testing with,
Operating System: Ubuntu 18.04.1 @ 4.15.0
CPU: Intel i5-2400

and I got similar results to your own (messages win).

However, as I started to increase the complexity of the benchmark I saw locks start to win, again.

I really didn't want to believe locks would hurt performance that much. Mostly because reworking every shared data structure into a separate future, channel, and messages feels like cruft, but the impression started to build that while messaging might impose some constant in overhead initially it holds out under heavy load.

So, to really stress the server I think we need to add more "thinks" to the script so clients hang around to receive messages. I'd be interested to see if you have the same results, but when you add the "thinks" and increase the number of clients I believe you can see the lock contention begin to cause connection timeouts and a really significant degradation in connectivity.

benchmark

I used a ulimit -n unlimited before cargo run, and my config is here:
https://gist.github.com/kwarrick/039761f00e1a41f92f97454eaf05af2f

@asonix
Copy link
Author

asonix commented Jan 23, 2019

I'm going to be honest, I don't know enough about benchmarking and artillery to understand what adding more thinks does. The docs say "To pause the virtual user for N seconds, use a think action" but does that mean pausing artillery? Does it pause in the middle of a request? between requests?

@asonix
Copy link
Author

asonix commented Jan 23, 2019

Another reason why my implementation may fare better than Jay's implementation when many requests are made is mine properly handles dropping the streams and sinks for disconnected clients. I'm not sure it has an effect over short time periods, since the server might not notice the client is gone immediately, but if artillery properly disconnects and doesn't just vanish, the server will immediately drop the associated stream and sink.

@kwarrick
Copy link

kwarrick commented Jan 23, 2019

You are probably tired of this thread and my blundering, long-winded comments, but Jay your post was perfectly timed for a project I am working on. So, thank you and please forgive all my noise.

Just glanced at tokio-sync, and it looks really promising. Might make this all moot, but I really want to know how bad locks are in tokio futures.

I'm a bit stubborn so I couldn't let it go and decided to profile Jay's solution, and I've found that benchmarking them against each other is not a fair comparison.

First, I found valgrind's DRD tool, which can detect lock contention:

$ valgrind \
      --tool=drd \
      --exclusive-threshold=5 \
      --shared-threshold=5 \
      ./target/debug/game_server 2>&1 \
  | grep 'Lock on'
==7868== Lock on mutex 0x651f5e0 was held during 40 ms (threshold: 5 ms).
==7868== Lock on mutex 0x65217c0 was held during 24 ms (threshold: 5 ms).
==7868== Lock on rwlock 0x650eeb0 was held during 14 ms (threshold: 5 ms).

After startup, no significant contention is reported, even under heavy load. So, if the locks aren't causing the service degradation, what is? Right?

Back to valgrind:

$  valgrind --tool=callgrind ./target/debug/game_server
$ callgrind_annotate callgrind.out.10279 | less
...
--------------------------------------------------------------------------------
            Ir  file:function
--------------------------------------------------------------------------------
32,366,192,376  /build/glibc-OTsEL5/glibc-2.27/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5 [/lib/x86_64-linux-gnu/libc-2.27.so]
 3,694,475,221  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:_int_malloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 3,301,726,832  /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b//src/libcore/fmt/mod.rs:core::fmt::write [/home/warrick/locks/rwlock/target/release/game_server]
 2,086,559,182  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:free [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,886,001,052  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:realloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,361,337,220  /build/glibc-OTsEL5/glibc-2.27/malloc/malloc.c:_int_realloc [/lib/x86_64-linux-gnu/libc-2.27.so]
 1,315,626,578  /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b//src/libcore/fmt/mod.rs:core::fmt::Formatter::pad_integral [/home/warrick/locks/rwlock/target/release/game_server]
...

It looks like memcpy, malloc, and fmt are dominating.

This brings me to why benchmarking the solutions against each other is not fair for comparing messaging and locking.

  1. There is a logical memory leak; "entities" are added but never removed. So, each subsequent connection receives larger and larger game state.
  2. Serialization is inside a loop, so entities are serialized redundantly for each connection, which, as callgrind shows, is very expensive.
  3. Every "tick", all of the connections are removed and then re-added to the HashMap.

@asonix actually fixes all of these problems, so the benchmarks aren't actually revealing anything about locking vs messages.

I say "fixes", but I guess if the game is designed such that clients state should be persisted after disconnect, number one is not a problem and there is not really a "memory leak", but the game servers do behave differently.

Anyways, I used your solution asonix to refactor the code, and I am going to try to benchmark again to see if lock contention causes performance problems.

My changes: jaybutera:0d22d4fc636...kwarrick:9dc1698e94

Let me know what you think, and I will come back with my results if you are interested.

@jaybutera
Copy link
Owner

Nice catch I'm looking forward to seeing the results.

@asonix
Copy link
Author

asonix commented Jan 24, 2019

hey @kwarrick can I ask what you're doing with this sink here?
screenshot from 2019-01-24 11 09 36

It doesn't seem to get used anywhere, and when a tx half of a channel is dropped, the rx side's stream ends.

@asonix
Copy link
Author

asonix commented Jan 24, 2019

oh nevermind, you insert it into connections

@asonix
Copy link
Author

asonix commented Jan 24, 2019

I misread that as id.sink instead of id, sink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants