Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping postprocessing service #9096

Open
jvillafanez opened this issue May 7, 2024 · 4 comments
Open

Stopping postprocessing service #9096

jvillafanez opened this issue May 7, 2024 · 4 comments

Comments

@jvillafanez
Copy link
Member

Is your feature request related to a problem? Please describe.

There isn't a clean way to gracefully stop the postprocessing service.

Describe the solution you'd like

The postprocessing service should be able to be stopped in a graceful way on demand. Right now, the service is killed, which could lead to problems due to inconsistent states that could happen if the service is abruptly stopped.

In order to provided a graceful shutdown of the services there are some things to consider:

  • The postprocessing service can store metadata in the memory with no persistence. This could be a problem because even if the restart the service (without killing the whole app), the previous metadata would likely be inaccessible; we might need to process events whose metadata has been lost.
  • I don't think there is any guarantee about the order of the events coming from the event queue (nats I guess). This means we might get events for uploads 2 and 3 but the postprocessing of the upload 1 is still ongoing, and the "postprocessing done" event might be queued after other events. This has a couple of consequences:
    • As said in the previous point, we might need to ensure that all the postprocessing of all ongoing uploads are completed. By doing this, we shouldn't need to persist any metadata because all the uploads would be done.
      This should follow the standard procedure of a server shutdown: no new event will be processed, and we'll let any ongoing upload to fully finish. However, this could mean that shutting down the service could take a while (the timeout of the runner is 10 secs by default, so we might need to adjust the value for this specific service)
    • Since we don't have any event order guaranteed, we'll need to skip over some events. Skipping implies that the event will remain in the queue (so it's processed later or by a different postprocessing replica), but we still need to get events in order to finish the current uploads.

Some technical considerations:

  • We're connected to an event stream. We'll need to disconnect from that stream so we don't get any additional event.
  • From the outside, the same way there is a Run method to start the postprocessing service, we should have a Stop method in order to stop it. This will hide all the complexity described above.
  • Times for the shutdown of the service should be measured in order to provide a reasonable timeout value for the runners. We want to give enough time for the service to shutdown properly, and the default value of 10 secs might not be enough.
  • Ideally, the Run method should return only when everything is done and the service has completely finished. This is what the runners expect. Alternatively, the responsibility could fall into the Stop method, which should return when the service has completely finished. Go-micro seems to use this other approach.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@kobergj
Copy link
Collaborator

kobergj commented May 8, 2024

The postprocessing service can store metadata in the memory with no persistence.

This is not correct. Postprocessing stores its data in a configurable store (nats by default). This store is persistent across restarts. That means postprocessing can be killed and restartet without losing data.

It only needs to finish working on events it is currently working on.

Note: Almost all services are connected to the event stream. If this is a problem for graceful shutdown, all services have it.

Since we don't have any event order guaranteed, we'll need to skip over some events.

Why? The order of the events does not matter to postprocessing service.

@jvillafanez
Copy link
Member Author

The postprocessing service can store metadata in the memory with no persistence.

This is not correct. Postprocessing stores its data in a configurable store (nats by default). This store is persistent across >restarts. That means postprocessing can be killed and restartet without losing data.

The docs mentions an in-memory store used by default, which seems to be accessible. Maybe we should either deprecate or remove that option at least from the docs.

Since we don't have any event order guaranteed, we'll need to skip over some events.

Why? The order of the events does not matter to postprocessing service.

I assume the events are persisted in nats, so it depends on whether the metadata is also persisted or not.

If the metadata is persisted, then we could stop the postprocessing service any time. After restarting the service, we receive the next event, we get the metadata and we continue the postprocessing. The behavior would be similar to the service having a big delay.

Since the in-memory store is a valid option (not yet deprecated / removed), we can't rely on the metadata to be persisted. If it isn't persisted, we have to process any event relying on metadata before stopping the service, otherwise, by the time we need to process the event, its metadata will be gone and will cause issues.

A possible scenario is that we start the postprocessing of upload 10 and we need to stop the service at that point. If the metadata isn't persisted, we'd need to keep processing events until we're done with the upload 10, however, there could be more events between the start and the end of the upload 10. These events should be skipped because we don't want to start processing new uploads when we've received a stop signal.

If we can guarantee that the metadata is persisted, it would simplify the handling because we could stop processing the events anytime.

@kobergj
Copy link
Collaborator

kobergj commented May 8, 2024

The docs mentions an in-memory store used by default

Yes true, postprocessing docs are outdated. nats-js-kv is used by default: https://github.com/owncloud/ocis/blob/master/services/postprocessing/pkg/config/defaults/defaultconfig.go#L37-L41

Maybe we should either deprecate or remove that option at least from the docs.

Imo not needed. We can add a section to postprocessing docs that inmem should only be used for testing or in small installations. If files get stuck because of the postprocessing service dying, one needs to manually restart the postprocessing of these uploads.

I assume the events are persisted in nats, so it depends on whether the metadata is also persisted or not.

I don't understand. What do you mean by metadata?

Since the in-memory store is a valid option (not yet deprecated / removed), we can't rely on the metadata to be persisted. If it isn't persisted, we have to process any event relying on metadata before stopping the service, otherwise, by the time we need to process the event, its metadata will be gone and will cause issues.

Let's keep it simple. If you restart your inmem postprocessing service you need to restart your uploads too. I think that is quite fair. We even have a simple ocis command to do so.

@jvillafanez
Copy link
Member Author

Initial solution included in #9048 .

There is a small problem though: it seems reva spawns a goroutine in order to deliver the events to the postprocessing service through a channel, however, this goroutine won't finish and will get stuck waiting for the event to be read from the channel. It's also unclear what happens down below, specially using natsjs, because we aren't unregistering the connection.
This should be fine if the goal of stopping the service is to provide a graceful shutdown, which means the program will end shortly after. However, recreating the service could lead to memory leaks because reva's goroutines would still block resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants