State space exploration based on Shadow #3203

Jevaites · 2023-10-09T14:40:05Z

Jevaites
Oct 9, 2023

Hello,
For a research project I'm considering using Shadow to enable state space exploration of distributed systems.
I would simulate said distributed systems with Shadow, but add a replay mechanism to explore the execution with different message delivery order, then go go back to continue the simulation etc. and explore the state space that way.

I wanted to ask you if you think using Shadow as a base is a good idea for this, if this sounds feasible to you, and also if you could give me some pointers as to how I would actually implement this without breaking everything, since the codebase is quite dense, and your insight might save me a lot of time and trouble.

Is it easy/feasible to access the state of the simulated program from the simulation controller, to modify the message delivery order, and to modify the simulation flow to introduce this replay mechanism ?

I appreciate any help, thanks in advance ! (if needed I'd be happy to hop on a call)

PS: I already read the paper, the docs, I run the examples with Shadow and I read the codebase (superficially), I'm looking for pointers to go deeper.

stevenengler · 2023-10-09T20:40:42Z

stevenengler
Oct 9, 2023
Collaborator

This sounds like an interesting use-case for Shadow.

Is it easy/feasible to access the state of the simulated program from the simulation controller

I would say it's easy to access the state of the simulated program, but it wouldn't be easy to restore the state of the simulated program. For example if you were envisioning a checkpoint/restore type feature that lets you resume from an earlier point in the simulation, that would probably be tricky to get working. You would need to restore Shadow's internal simulation state, and also the state of the Linux process (memory such as the stack and heap, registers, misc syscalls that Shadow passes through to Linux, etc). The checkpoint/restore project might be able to help for the Linux part. So it's probably possible, but would be a lot of work.

Since Shadow is (mostly) deterministic, it might be easier to just restart the simulation from the beginning to get back to your "checkpoint" state. But this could be costly in terms of simulation time if you have a large space to explore.

to modify the message delivery order

Shadow delivers packets from one host to another by creating an event at a time based on the configured latency between the two hosts. So if the latency between hosts A and B is 50 ms, a packet sent from A to B will be delivered in exactly 50 ms. What way would you plan to modify the message delivery order? Just reorder packets that arrive at the same time? Or add some network jitter so that packets don't arrive at a consistent time?

5 replies

stevenengler Oct 9, 2023
Collaborator

One option to look into could be running Shadow in a Docker container, and using Docker with CRIU to checkpoint/restore the entire container.

sporksmith Oct 10, 2023
Collaborator

A process checkpoint/restore project I happen to be familiar with is Flashback. It's quite old now, but might also be worth a look. I think a similar idea could work with an implementation inside Shadow (instead of in the Linux kernel), but handling file descriptors and threads could be difficult. (I think Flashback also had some challenges there)

sporksmith Oct 10, 2023
Collaborator

MineSweeper might also be interesting to look at; I worked on that one, but more on the symbolic execution part than the checkpoint/restore part. IIRC we used QEMU to checkpoint/restore the whole VM. I suppose you could do something similar and rollback a VM that's running a shadow simulation, rather than trying to build checkpoint/restore inside Shadow itself. Then you'd "just" need to add some hooks in Shadow to let you alter message ordering etc on replay.

sporksmith Oct 10, 2023
Collaborator

I guess the latter is similar to @stevenengler's CRIU idea, which would probably be a bit lighter weight 😆

robgjansen Oct 10, 2023
Maintainer

If your distributed system bootstrap time is small, then just restarting Shadow from the beginning as @stevenengler suggested makes the most sense to me. Otherwise, I agree that an external tool would probably work more smoothly and be far easier to get started versus instrumenting Shadow itself.

For example, the following seems plausible to me. You could set up your base simulation where the nodes in your simulation read in a params.config file at e.g. 601 seconds, and the config file that it reads instructs the nodes which part of the state space they explore. Then to initialize, you run shadow to 600 seconds, send it a SIGSTOP signal, and then save a container checkpoint. From there, you just need a script that restores your checkpoint, overwrites the params.config with whatever params you want to test in this run, and then sends a SIGCONT to Shadow. When shadow advances to 601, your nodes will read the params.config file and run the test as configured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State space exploration based on Shadow #3203

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

State space exploration based on Shadow #3203

Jevaites Oct 9, 2023

Replies: 1 comment · 5 replies

stevenengler Oct 9, 2023 Collaborator

stevenengler Oct 9, 2023 Collaborator

sporksmith Oct 10, 2023 Collaborator

sporksmith Oct 10, 2023 Collaborator

sporksmith Oct 10, 2023 Collaborator

robgjansen Oct 10, 2023 Maintainer

Jevaites
Oct 9, 2023

Replies: 1 comment 5 replies

stevenengler
Oct 9, 2023
Collaborator

stevenengler Oct 9, 2023
Collaborator

sporksmith Oct 10, 2023
Collaborator

sporksmith Oct 10, 2023
Collaborator

sporksmith Oct 10, 2023
Collaborator

robgjansen Oct 10, 2023
Maintainer