Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Concurrent requests with the streaming feature produce parallel calls to the runner #4624

Open
Hubert-Bonisseur opened this issue Mar 29, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Hubert-Bonisseur
Copy link

Hubert-Bonisseur commented Mar 29, 2024

Describe the bug

To activate the streaming capability in bentoML, you require a Runnable function that yields an AsyncGenerator. Consequently, invoking this function returns promptly, regardless of ongoing computations that produce outputs. Consequently, the Runnable function is always deemed complete, initiating immediate processing for all service requests, irrespective of any ongoing computations from a prior generator. Consequently, there's no limit on the memory footprint of the runner.

To reproduce

No response

Expected behavior

The service should wait for the first AsyncGenerator to complete before requesting a new one.

A simple fix to this issue is to add a lock at the start of the runnable method:

    def __init__(self):
        self.predict_lock = threading.Lock()

    def predict(self, input) -> AsyncGenerator[str, None]:
        with self.predict_lock:
              # compute and yield whatever
               pass

I think this locking mechanism should either be implement on the side of bentoML or its necessity should be made clear in the documentation

Environment

bentoml==1.1.4

@Hubert-Bonisseur Hubert-Bonisseur added the bug Something isn't working label Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant