performance problems with large experiments #3238

jtracey · 2023-11-26T02:37:58Z

jtracey
Nov 26, 2023

I'm trying to run some relatively large experiments (a few million concurrent connections for bootstrapping on top of a tor 10% network), but am running into issues with performance. There seems to be an elbow in performance, because ~1 million connections works okay, but much beyond that and I can make barely any progress. Poking around with perf and gdb seems to blame (as of git rev d7966a7):

#0  0x000055f7150358c0 in shadow_rs::host::descriptor::socket::inet::InetSocket::canonical_handle (self=0x7f744519fe40) at main/host/descriptor/socket/inet/mod.rs:83
#1  0x000055f71509c3ce in compatsocket_getCanonicalHandle (socket=0x7fccbedf3c20) at host/descriptor/compat_socket.c:74
#2  0x000055f7150c3acb in _compareTaggedSocket (a=0x7f744519fe42, b=0x7f6cab3e7402) at host/network/network_queuing_disciplines.c:26
#3  0x000055f7150ca678 in _priorityqueue_find_helper (key=0x7f744519fe42, value=0x55f8f0f29840, user_data=0x7fccbedf3d20) at utility/priority_queue.c:166
#4  0x00007fcdf9e0d7b7 in g_hash_table_find () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x000055f7150ca736 in priorityqueue_find_custom (q=0x55f7875de600, data=0x7f6cab3e7402, compareFunc=0x55f7150c3a5a <_compareTaggedSocket>) at utility/priority_queue.c:185
#6  0x000055f7150c4675 in fifosocketqueue_find (self=0x55f787c725e0, socket=0x7f6cab3e7420) at host/network/network_queuing_disciplines.c:168
#7  0x000055f7150c368f in networkinterface_wantsSend (interface=0x55f787c725c0, socket=0x7f6cab3e7420) at host/network/network_interface.c:370
#8  0x000055f715055832 in shadow_rs::host::network::interface::NetworkInterface::add_data_source (self=<optimized out>, socket_ptr=0x7f6cab3e7420) at main/host/network/interface.rs:129
#9  shadow_rs::host::host::Host::notify_socket_has_packets (self=0x55f787c72be0, addr=..., socket_ptr=0x7f6cab3e7420) at main/host/host.rs:992
#10 0x000055f715362a9d in shadow_rs::utility::legacy_callback_queue::export::socket_wants_to_send_with_global_cb_queue::{closure#0}::{closure#0}::{closure#0}::{closure#0} (host=0x55f787c72be0)
    at main/utility/legacy_callback_queue.rs:116
#11 shadow_rs::core::worker::{impl#0}::with_active_host::{closure#0}::{closure#0}<shadow_rs::utility::legacy_callback_queue::export::socket_wants_to_send_with_global_cb_queue::{closure#0}::{closure#0}::{closure#0}::{closure_env#0}, ()>
    (h=<optimized out>) at main/core/worker.rs:114

The stack trace is much deeper than that, but that seems to cover the relevant bits. In particular, it looks like the problem is when trying to find a socket, the hash table is missing and doing a linear search in something that's way too big.

Should I be spreading these connections over a larger number of hosts (I'm currently just scaling connections on a fixed number of clients), or is this hash table global? Are there patches to shadow I can make to avoid the linear lookup without breaking things?

Answered by stevenengler

Nov 27, 2023

@jtracey Can you try with the commits from #3239 (edit: one commit was split off into #3240) and see if it scales any better? It might be a few percent slower in the general case due to a more-expensive hash function, but removes the linear search.

View full answer

stevenengler · 2023-11-26T04:26:02Z

stevenengler
Nov 26, 2023
Collaborator

In the simulation, is it a 10% network that's been bootstrapped, and then you're running something on top which establishes ~1 million additional connections? And you're scaling up the number of these concurrent connections while using the same number of hosts?

Ideally there shouldn't be an elbow in performance like that. As for the hash table stuff, that linear search was needed at one point to be able to support the migration of C sockets to rust, but I don't think I considered that it would become a performance-sensitive linear search at the time. I think this is only used to prevent adding duplicate sockets to the network interface's priority queue, so it would make sense that the performance gets worse as the network interface has more sockets trying to send at the same time (so the queue is generally larger).

I'll take a look tomorrow to see if that's still needed and what can be done about that. There's also the round-robin queuing discipline, but that has always used a GQueue linear search.

3 replies

stevenengler Nov 27, 2023
Collaborator

@jtracey Can you try with the commits from #3239 (edit: one commit was split off into #3240) and see if it scales any better? It might be a few percent slower in the general case due to a more-expensive hash function, but removes the linear search.

Answer selected by jtracey

jtracey Nov 27, 2023
Author

Wow, that made an absolutely massive difference! I don't have exact numbers, but easily several orders of magnitude. perf top now puts the kernel symbol syscall_exit_to_user_mode as the main performance culprit, but not as majorly. I can poke around the performance more and grab some backtraces if you'd like, but this looks likely to be good enough for our purposes for now.

In the simulation, is it a 10% network that's been bootstrapped, and then you're running something on top which establishes ~1 million additional connections? And you're scaling up the number of these concurrent connections while using the same number of hosts?

Yes that's right, I have an application serving a similar (but distinct) role as tgen that performs its own boostrap after Tor's by registering each simulated client (where a single application is simulating multiple clients) to a central uncapped server, and those initial connections were causing the simulation to choke. It's possible I should be running the clients over more nodes anyway.

robgjansen Nov 27, 2023
Maintainer

Glad Steve was able to remove the main bottleneck quickly!

@jtracey if you find another performance bottleneck that you think might be worth fixing, please do post a backtrace (perhaps in a new discussion post since this one seems to be closed out now).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance problems with large experiments #3238

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

performance problems with large experiments #3238

jtracey Nov 26, 2023

Replies: 1 comment · 3 replies

stevenengler Nov 26, 2023 Collaborator

stevenengler Nov 27, 2023 Collaborator

jtracey Nov 27, 2023 Author

robgjansen Nov 27, 2023 Maintainer

jtracey
Nov 26, 2023

Replies: 1 comment 3 replies

stevenengler
Nov 26, 2023
Collaborator

stevenengler Nov 27, 2023
Collaborator

jtracey Nov 27, 2023
Author

robgjansen Nov 27, 2023
Maintainer