Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give access to subsystem handle from toplevel #55

Closed

Conversation

Ten0
Copy link

@Ten0 Ten0 commented Mar 12, 2023

This allows accessing a more permissive start function, as well as starting nested subsystems without introducing one additional level of indirection that requires Send + 'static

This allows accessing a more permissive `start` function, as well as starting nested
subsystems without introducing one additional level of indirection that requires
`Send + 'static`
@Ten0 Ten0 force-pushed the give_access_to_subsystem_handle_from_toplevel branch from 041f191 to c435783 Compare March 12, 2023 10:39
// Obviously these two things are more complex in reality
// thing needs &mut monitoring to initialize, and then that needs to be moved
// to the Monitoring task
let mut monitoring_registry = ();

Check warning

Code scanning / clippy

this let-binding has unit value

this let-binding has unit value

let tasks = Toplevel::<anyhow::Error>::nested(toplevel.subsystem_handle(), "Tasks");
for task in 0..2 {
let thing_to_work_on = init_thing(&mut monitoring_registry);

Check warning

Code scanning / clippy

this let-binding has unit value

this let-binding has unit value
.start(&format!("Task {task}"), move |subsystem| async move {
// This simulates normal work with, but obviously it would normally wait on select
// on `thing` and on_shutdown_requested
let _thing = thing_to_work_on;

Check warning

Code scanning / clippy

this let-binding has unit value

this let-binding has unit value
res
})
.start("Monitoring", move |ss| async move {
let _monitoring = monitoring_registry;

Check warning

Code scanning / clippy

this let-binding has unit value

this let-binding has unit value
@Finomnis
Copy link
Owner

Finomnis commented Mar 13, 2023

I think there are a couple of misconceptions in your example:

  • ss.request_global_shutdown() is probably the wrong function in "Tasks", ss.request_shutdown() is probably what you want.
  • tasks_ss.request_shutdown(); does not shut down tasks_ss; it shuts down the Toplevel it is contained in. The only way of shutting down a nested subsystem is from its parent.
  • Don't delay returning res until the end of the function in "Tasks". Returning an error triggers a shutdown automatically.

Out of curiousity/feedback:

  • Why do you use a nested subsystem in the first place? Because you want to shutdown as soon as all the nested tasks are finished? I ask because I try understand what API I should improve in a potential rework. The current implementation is quite robust, but is almost impossible to extend any further in functionality. I ask about this specifically because one functionality I already found that is desperately needed is wait_for_all_children() inside of a subsystem. This would prevent the necessity to use a nested toplevel, correct?

I'm not sure if I like exposing the subsystemhandle in the toplevel object. It's more of an implementation detail and it's not supposed to be used the way you intend to use it here. I feel like a refactoring in your code could achieve a better result. Maybe I'm wrong though, I might have to think about it more. I currently have very little time though, sadly, so it might have to wait a little :/ I apologize.

@Ten0
Copy link
Author

Ten0 commented Mar 14, 2023

Apart from the poor syntax when spawning tasks in a loop, which I'd like to avoid, the impossibility to create from the main thread objects that contain a Subsystem so that they are able to spawn new tasks autonomously (new need my example doesn't showcase really but that goes with the restrictive interface), yes the way this example is built showcases a need to shut down in a particular order, where the monitoring stops after all the other tasks are finished shutting down.

While one vision of it could be that Subsystem is an implementation detail, another could be that it's a remote referencing a Top-level, that enables starting tasks, stopping it, stopping globally, and checking whether one should stop. While this may imply some renaming, it seems like a reasonable public api, and seems much less restrictive.

@Finomnis
Copy link
Owner

Finomnis commented Mar 14, 2023

Are you 100% this does not break anything? For example, it might allow spawning new tasks while the system is already shutting down. I hid the toplevel subsystem on purpose, because I don't think it's safe to be used directly in all cases.

I think what your problem showcases instead is that a rework is indeed necessary.

Although I sadly have little time at the moment. Maybe someone can use my solution as inspiration for a better API?

@Ten0
Copy link
Author

Ten0 commented Mar 14, 2023

For example, it might allow spawning new tasks while the system is already shutting down

IIUC all that is already possible once the subsystem is leaked through any task.
I guess if I wanted to circumvent this imitation, I could spawn a task that sends its subsystem through a oneshot channel, and await the receiver right after. That just seems like a terribly wrong way to do it compared to opening this PR.

ss.request_global_shutdown() is probably the wrong function in "Tasks", ss.request_shutdown() is probably what you want.

IIUC these achieve the same effect in this case because ss at that point is the top-level Toplevel. (naming ^^) I just wanted to make it very clear that at that point I really wanted to shutdown everything and not just tasks.

tasks_ss.request_shutdown(); does not shut down tasks_ss; it shuts down the Toplevel it is contained in. The only way of shutting down a nested subsystem is from its parent.

Yes, and that top-level being tasks, it means I'm only requesting shutdown of down tasks, and in particular not that of monitoring.

@Finomnis
Copy link
Owner

For example, it might allow spawning new tasks while the system is already shutting down

I think this might be covered, because there is already a mechanism in place that prevents new tasks from getting spawned once the system is shutting down. There is an option somewhere that gets collected on shutdown. So this might be safe after all.

@Finomnis
Copy link
Owner

Finomnis commented Mar 14, 2023

I feel like my crate is too opinionated for your usecase...

I really don't feel comfortable with leaking the SubsystemHandle. I kind of regret the decision to make it owned/clonable in the first place. The handle was originally meant to stay within the subsystem, to interact with the shutdown system. The thing returned from "start" is the thing that should be used as an external handle.

But I think the API got too convoluted and the naming is all wrong. And now people use the internal handle as external handles and store them in structs and stuff.

I'd like to perform a rework where the internal token is a scoped mut reference and can't be leaked out of the subsystem, and then also an external handle that is more powerful than it is currently.

@Ten0
Copy link
Author

Ten0 commented Mar 14, 2023

Yet as far as I could tell from looking for a lib that would handle graceful shutdowns in a tokio universe on lib.rs, this is the most adapted to my use-case ;)

@Finomnis
Copy link
Owner

I think I'll reject this for now as it doesn't really fit into existing design decisions.

@Finomnis Finomnis closed this Apr 22, 2023
@Ten0
Copy link
Author

Ten0 commented Jun 12, 2023

I'd like to perform a rework where the internal token is a scoped mut reference and can't be leaked out of the subsystem, and then also an external handle that is more powerful than it is currently.

FYI (in case that would match the rework you had in mind), I ran into issues using this crate where it would sometimes freeze when tasks were supposed to timeout and that was very hard to investigate because there is a lot of code and I couldn't extract a minimal repro, so I figured the easiest way to fix it would be to just rewrite a simpler version of this crate from scratch.

I've published it as tokio_tasks_shutdown (there's 3x less code than in tokio-graceful-shutdown).

@Finomnis
Copy link
Owner

Is the freeze related to #50?

If so, I don't think you will get rid of it in a rewrite, it's deeply rooted in tokio...

But sure, that's why I like the open source community, if something isn't good enough people can always fork/rewrite/improve. Kudos for the effort :) might try it out some day

@Finomnis
Copy link
Owner

But I would indeed be happy about a reproducible hang, a lot of effort was put into making this entire thing as reliable as possible, so if I missed some weird corner case, I'd love to hear about it.

@Ten0
Copy link
Author

Ten0 commented Jun 12, 2023

I would indeed be happy about a reproducible hang, a lot of effort was put into making this entire thing as reliable as possible, so if I missed some weird corner case, I'd love to hear about it.

I really tried: I did spend 2 hours trying to extract the repro and didn't manage (that was made even harder by the fact this would only happen in production). It seemed that either TaskHandle::abort() was never called for some tasks, or it had no effect (we had the timeout log), and then we were hanging on

result = self.attempt_clean_shutdown() => result

I had a look at #50 back then, and it seemed that wasn't it, since:

  • We didn't get a successful shutdown (it hangs before)
  • Before adding tokio-graceful-shutdown to the program, it wouldn't hang, which led me to think that a hanging task wasn't what was causing the issue

However, that is for this scenario that I added an extra timeout to stop waiting on aborted hanging tasks, so user can make sure this wouldn't hang even in that case.

Design-wise I'm using a single control task selecting on channels instead of mutexes, I think that might make it easier to reason about (and of course even simpler since I didn't put subsystems concept), so hopefully if there is a such potential corner case I didn't hit it 🤷, and if #50 was indeed the issue the new timeout should solve it (worse-case scenario user can always process::exit as long as the function returns).

I'll let you know if that doesn't work. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants