Simulations limited by `PID_MAX_LIMIT` #3323

jtracey · 2024-04-03T22:43:28Z

From the proc man page:

On 64-bit systems, pid_max can be set to any value up to 2ha22 (PID_MAX_LIMIT, approximately 4 million).

This is a hard limit, that can't be modified the way many of the other limits in Linux can. It essentially bounds the size of experiment threads * number of parallel experiments for one machine, where "experiment threads" is the total number of threads across all processes in the experiment (less a little headroom from the rest of the OS). I ran into this from running arguably too many experiments at once for one machine (though the machine had plenty of RAM to spare), but I suspect this might even prevent running large single experiments. When's the last time someone tried to run a 100% Tor network?

This seems unlikely to have a quick fix. Some possible solutions are:

as a workaround, if running lots of moderate or small size simulations, use a hypervisor to split up the machine
make that less of a workaround by letting shadow run across multiple hosts (see Multi-process simulation support #176, which is outdated but mentions this), possibly with some fancy shared memory techniques
find a way to get Shadow to run VMs as processes, and put all of the host's processes in that
find some other way to reduce or combine process threads
patch Linux to bump PID_MAX_LIMIT

I suppose running pre-phantom Shadow is another workaround, but that sounds like a bad and increasingly difficult idea.

In the meantime, it's probably a good idea to document this limit.

The text was updated successfully, but these errors were encountered:

stevenengler · 2024-04-03T22:56:37Z

You're probably already using it (and it's included in simulations generated by tornettools), but for reference just mentioning the existence of the NumCPUs torrc option. IIRC setting it to 1 will cause tor to use 2 threads, the main thread and a worker thread. It defaults to a max of maybe 8, depending on the number of available CPUs.

jtracey · 2024-04-03T23:31:46Z

Right, I think tornettools does this (or I did and forgot), but yes tor.common.torrc has NumCPUs 1. I see 3 TIDs associated with each tor PID, only one of which has any notable CPU time.

robgjansen · 2024-04-04T13:10:18Z

When's the last time someone tried to run a 100% Tor network?

I think that was us + Ian :) in Table 2 from our USENIX paper:
100%: 6,489 relays and 792k users

We don't typically run a tor+tgen for those 792k users, we instead use tornettools --process_scale=0.01 to create the necessary user load with fewer processes. Thus, I expect we would need on the order of tens of thousands to maybe a hundred thousand processes for a 100% network. Then if we factor in the threads each process is using, I think we're still below a million?

I wonder why they chose a hard upper limit for PID_MAX_LIMIT. Systems people don't like limits ;)

robgjansen · 2024-04-04T13:24:07Z

On a more serious note, many people are going to have access to smaller machines and not that many people are going to have access to giant near-super computers. So I think designing for the general case is the correct strategy for Shadow. Thus, multi-machine simulation support would be the feature I would support on the Shadow side, and it would have other benefits as well. It may allow people to utilize many small cheaper machines more effectively.

For those of us wanting to run a crazy number of simulations on one machine, the hypervisor approach could work. I never played around with the type of configuration we want, but it might be worth documenting if we figure out how to do it.

sporksmith · 2024-04-04T14:31:19Z

For the multiple-simulation use-case, I wonder if this limit is actually global or if it's per PID namespace? https://www.man7.org/linux/man-pages/man7/pid_namespaces.7.html

If the latter, then maybe putting each sim in its own pid namespace would at least be a somewhat lighter weight solution than putting them each in a full VM

jtracey · 2024-04-10T18:23:18Z

On a more serious note, many people are going to have access to smaller machines and not that many people are going to have access to giant near-super computers. So I think designing for the general case is the correct strategy for Shadow. Thus, multi-machine simulation support would be the feature I would support on the Shadow side, and it would have other benefits as well. It may allow people to utilize many small cheaper machines more effectively.

Agreed, more commonly available setups should definitely be the priority. I was just discussing it with Ian, and he wanted me to make sure this limitation is documented somewhere, since it did ultimately limit the size of experiments we could run in a feasible amount of time. :)

For the multiple-simulation use-case, I wonder if this limit is actually global or if it's per PID namespace?

That's a good idea. I suspect there will still be some kernel data structure somewhere that won't allow it, but I'll try to test that and see what happens.

jtracey added the Type: Bug Error or flaw producing unexpected results label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulations limited by `PID_MAX_LIMIT` #3323

Simulations limited by `PID_MAX_LIMIT` #3323

jtracey commented Apr 3, 2024

stevenengler commented Apr 3, 2024

jtracey commented Apr 3, 2024

robgjansen commented Apr 4, 2024 •

edited

robgjansen commented Apr 4, 2024 •

edited

sporksmith commented Apr 4, 2024

jtracey commented Apr 10, 2024

Simulations limited by PID_MAX_LIMIT #3323

Simulations limited by PID_MAX_LIMIT #3323

Comments

jtracey commented Apr 3, 2024

stevenengler commented Apr 3, 2024

jtracey commented Apr 3, 2024

robgjansen commented Apr 4, 2024 • edited

robgjansen commented Apr 4, 2024 • edited

sporksmith commented Apr 4, 2024

jtracey commented Apr 10, 2024

Simulations limited by `PID_MAX_LIMIT` #3323

Simulations limited by `PID_MAX_LIMIT` #3323

robgjansen commented Apr 4, 2024 •

edited

robgjansen commented Apr 4, 2024 •

edited