Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slot supplier interface & fixed-size implementation #2014

Merged
merged 12 commits into from
Apr 10, 2024

Conversation

Sushisource
Copy link
Member

@Sushisource Sushisource commented Mar 22, 2024

What was changed

Implements the slot management proposal

Why?

Users have expressed the desire for more control over when workers choose to poll for new tasks

Checklist

  1. Closes

  2. How was this tested:
    Existing tests (adding some additional UTs in progress)

  3. Any docs updates needed?
    There will be, after this has had some time in preview

@Sushisource Sushisource requested a review from a team as a code owner March 22, 2024 21:21
@Sushisource Sushisource marked this pull request as draft March 22, 2024 21:22

void markSlotUsed(SlotInfo info, SlotPermit permit);

void releaseSlot(SlotReleaseReason reason, SlotPermit permit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can releaseSlots be blocking? If we had some sort of remote slot provider it seems like it could be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can allow it to be blocking. The code has no expectation now that it might, and allowing it to could I think gum up stuff in unexpected ways. I mentioned in the proposal that users can delegate any blocking work to a background thread or to calls to reserve slot, and I think that's what they ought to do.

@Sushisource Sushisource force-pushed the slot-supplier branch 2 times, most recently from d814a14 to 76a5921 Compare March 25, 2024 23:54
@Sushisource Sushisource force-pushed the slot-supplier branch 9 times, most recently from 767552a to 7221d33 Compare April 5, 2024 20:44
@Sushisource Sushisource marked this pull request as ready for review April 5, 2024 20:44
@Sushisource Sushisource changed the title [DRAFT] Slot supplier interface & fixed-size implementation Slot supplier interface & fixed-size implementation Apr 5, 2024

releaseSlots(this.outstandingReservationSlotsCount - activityTasksCount);
// Release any slots that we won't be using
releaseSlots(this.reservedSlots.size() - activityTasksCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the slots are reserved based on the task queue don't we also need to keep track of that when using the slot and releasing the slot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I wouldn't say they're based on so much as that information is provided when making the reservation (implementations may or may not care). It does make sense, of course, that we might also want to provide that information when releasing.

Arguably, users could stuff that in the permit data, but maybe we want to make it first-class. In that case, I'd probably just stick it as a field in the permit. However, you could also argue that implementations that want to balance across multiple queues maybe don't want to associate a permit with a specific queue, in which case it makes more sense on some ReleaseContext object to provide as another arg when releasing.

The latter maybe makes a bit more sense to me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to release context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to clarify what I meant is your slot reservations are not fungible since they are tied to the task queue they are requested for. So you can't just blindly release first x slots you need to release the slots based on task queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah that makes sense. I don't know that it necessarily must work that way, we could only make the guarantee that we'll return permits since they all came from the same supplier regardless of queue, but we can do it fairly easily and it's probably helpful so I'll make that change.

*
* @return the maximum number of slots that can ever be in use at one type for this slot type.
*/
int maximumSlots();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I wonder now if we really need this? It is totally valid to have an unbounded threadpool in Java by setting maximumPoolSize to Integer.MAX_VALUE.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm just a little worried that could be a big footgun? They can always return a huge number here and accomplish the same thing if they want to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm just a little worried that could be a big footgun?

What is the footgun?

They can always return a huge number here and accomplish the same thing if they want to?

True, but why isn't that just the default then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footgun is possibly way over-subscribing the CPU in the case of activity executions, for example. You can easily end up running way too many concurrently. Buuuuut, I suppose that's definitely not happening if they use our fixed size supplier, since it'll do the right thing, and if they implement their own, well, then it should be fairly clear to them not to do that, or at least is their responsibility.

So, yeah, I can see allowing it to default to something big.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The footgun is possibly way over-subscribing the CPU in the case of activity executions, for example.

That is a problem of handing out to many slots. Basically I don't think there is any technical reason we need this in Java so if we add it here we should add it to all SDKs.

Comment on lines 93 to 96
this.maxConcurrentActivityExecutionSize = o.maxConcurrentActivityExecutionSize;
this.maxConcurrentWorkflowTaskExecutionSize = o.maxConcurrentWorkflowTaskExecutionSize;
this.maxConcurrentLocalActivityExecutionSize = o.maxConcurrentLocalActivityExecutionSize;
this.workflowSlotSupplier = o.workflowSlotSupplier;
this.activitySlotSupplier = o.activitySlotSupplier;
this.localActivitySlotSupplier = o.localActivitySlotSupplier;
Copy link
Member

@cretz cretz Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two recommendations here:

  1. Leave the existing fields and make new fields mutually exclusive with them, don't need to try to have multiple getters/setters reference the same fields, the builders as simple getters/setters is clearer IMO.
  2. Add a new interface for WorkerTuner that wraps the three suppliers (e.g. can just have three getters that are given basic worker info as params and default return null and a default implementation that accepts these three nullable values as constructor or something). We've had good success with the singular option field for interceptors because from a user perspective, they like to just set one thing and forget it (e.g. otel impl). In most cases these three will be intertwined I think, e.g. builder().setTuner(new ResourceLimitedTuner(someOptions)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Quinn-With-Two-Ns Do you have any thoughts here? I feel like there are a lot of options here and it's mostly preferential. I think Chad's suggestions make sense, but there are lots of viable approaches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with 1. , I don't have a strong opinion on 2 , but I slightly prefer 3 separate options for each SlotSupplier since it fits with a builder pattern better and handle a mix of SlotSupplier and legacy config.

Copy link
Member

@cretz cretz Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about it from a user POV. What is the simplest configuration for a user to choose "resource limited tuning"? I am happy we didn't make a user set 3 different interceptor fields for use of OTel interceptors. It's not that the 3 things can't still be provided, it's just from a user and default impl, people shouldn't have to set three things if they don't need to. Even for fixed, they should be able to construct one thing with options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's dramatically easier for the user to provide separate suppliers for separate kinds, which is I think more likely the more common case than a shared implementation across different kinds (at least for custom implementations) - and it makes that case no harder. Setting two more options fields is effectively zero effort.

If it was one option, we either have to have something like WorkerTuner.activity(a).workflow(b).localActivity(c) which is now the same thing, but with one extra step of putting that whole thing in the worker config. Or we have to change the whole interface to pass the type on every reservation call which I think is much messier and conceptually more difficult for most implementations.

I'll change the options to keep the old values, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, since we now already have mutually exclusive options, there's zero reason we can't do builder().setResourceLimitedTuner(options) and under the hood it sets all three.

Copy link
Member

@cretz cretz Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was one option, we either have to have something like WorkerTuner.activity(a).workflow(b).localActivity(c)

It'd be more like builder().setWorkerTuner(new TheTuner(a, b, c)) which is nice because people can extend that one or get/share one from a third party lib or whatever

Plus, since we now already have mutually exclusive options, there's zero reason we can't do builder().setResourceLimitedTuner(options) and under the hood it sets all three.

That means only we get the single-option approach because we are Temporal. Everyone should be allowed to have users use their tuner via a single option too. In the meantime, they may have to provide a static applyToWorkerOptionsBuilder to make sure all three suppliers are set right.

Same thing for interceptors. Datadog was able to provide https://pkg.go.dev/go.temporal.io/sdk/contrib/datadog/tracing#NewTracingInterceptor that is set as a single place but it intercepts three things: client, activity, and workflow. We definitely could have required that users give us all three interceptors separately but I'm glad we didn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's an apples-to-apples comparison at all though. Your tracing interceptor isn't making decisions about what happens where the first and most important criteria is the type of slot - rather it is dealing with application-wide traces, and may or may not care about the specifics of the thing being traced. Here, you always care about the type of slot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation always cares about the type of thing it's intercepting or the slot, the one setting the worker option does not. Well, they may in some cases (e.g. fixed size things they may provide three options to this tuner), but they may not in other cases (e.g. they only want to provide resource limits to tuner or just leave defaults). But it should be up to the tuner implementation to decide whether the configurer cares about separate things or not (the implementation itself of course will always care about separate things).

Even in cases where you may want to mix (e.g. fixed and resource-limited), I think still requiring an impl that returns different slot suppliers for those specific cases is better UX than making everyone care about those separate slot suppliers in every case.

I think it may be worth seeing how a user would provide their own implementation of a resource tuner that needs to work across the three (we shouldn't be special in that we're the only ones that get single-option config).

* the reservation, or during shutdown. You may perform cleanup, and then should rethrow the
* exception.
*/
SlotPermit reserveSlot(SlotReservationContext<SI> ctx) throws InterruptedException;
Copy link
Member

@cretz cretz Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SlotPermit reserveSlot(SlotReservationContext<SI> ctx) throws InterruptedException;
Object reserveSlot(SlotReservationContext<SI> ctx) throws InterruptedException

IMO we should return an Object (can make this generic, your choice). There's no benefit to the SlotPermit class IMO, anyone can make their own or if they don't care, they can return new Object(). They don't need IDs (objects already have IDs) and therefore the whole object is "user data".

Note, I originally was going to suggest returning a future since they have better cancellation mechanism than forced thread interruption (so we can have better performing things like virtual threads in the future), but I figured it'd be too much work for users currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the internal ID because it ensures equality & hashing always work properly, regardless of what the user ends up doing with their data (yeah, objects have an ID, but they could end up implementing eq/hash at the top level in a way that potentially breaks things - but to your point we could get rid of the internal ID and still have that work seemingly), and it also makes for a nice, clear type at the top-level. The type name alone feels worth it to me.

Copy link
Member

@cretz cretz Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, objects have an ID, but they could end up implementing eq/hash at the top level in a way that potentially breaks things - but to your point we could get rid of the internal ID and still have that work seemingly

I'd have to understand this a bit better. Equals/hashCode will work for new Object() directly and if someone overrides those for a specific object, I would expect them to work too. I don't think we should ever invoke equals/hashCode on this object for anything though.

Using user data inside the permit is a bit rough, but if we are expecting that to be the user method for extensibility, can you make the class final so it can't be extended? I would have originally thought we would encourage people to return their own instance to keep state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash/eq is used because we use it in the hash map in the tracking supplier. So, sure, we expect it to work just fine - but our implementation will always be correct (and likely cheaper) than a user-supplied one.

Copy link
Member

@cretz cretz Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should then make SlotPermit final (so nobody extends it and messes up this expectation) and you should rely on the default equals/hashCode which is identity based (i.e. don't override it). I think it's clearer that way. We should also document to users that the exact same instance of slot permit is the one given back later so they can have similar expectations.

private final String workerBuildId;
private final boolean fromStickyQueue;

public WorkflowSlotInfo(
Copy link
Member

@cretz cretz Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest hiding this object-building detail (can consider making slot infos abstract classes or interfaces or something and move construction to internal if wanted, or just change the constructor for every field in here and potentially add javadoc on this one and activity one saying not to rely on stability of constructor signature).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the lack of granular enough visibility specifiers is annoying. I'll use a doc comment for now and we can potentially change it later if/when we change more here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, but we should consider a POJO constructor for these fields too. People will be instantiating this to test their suppliers I think. (I wonder if we need to account for that with the context classes too and make those easy to make, but meh as interfaces they should be easy to mock)

this.userData = null;
}

public SlotPermit(Object userData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect slot providers will often be implemented by chaining or wrapping other slot providers, where each slot provider may want to attach data to the permit. Maybe an API like a Go context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking something like permit.withData(obj)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where allowing the Java author to return the type of their choice (even if they were forced to extend SlotPermit though not as much of a fan there) gives more freedom than attaching an object to an existing object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not want users to be able to return an object of their choice. I would like to maintain control over the object and provide an interface to attach and get metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Ok, no prob. Still not a big fan of the ID in here, but no prob there too. Mentioned in other comment, but we should probably make this class final, consider removing the equals/hashCode override (i.e. default to identity-based), and document on the context interfaces that the permits are the exact same instances as the ones returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify I was taking about having a map of userdata like a Go context does so different providers wrapping each other can add their own metadata without impacting each other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a typemap of <T -> Data> would be useful. But, perhaps we add that later if requested

Fix todos

Give eq/hash to SlotPermit

Rename package

Add more to slot infos

Move metric tracking inside of tracking supplier

Publish junit test report / set some task queue kinds

Don't bother with futures since the value they're providing is unclear

Make use of in-use slot tracking map

Internalize tracking slot supplier

[Trial] Seeing if a dedicated thread fixes fixed-size w/ futures

Use completable future, fix sched-to-start timeout test

Fix accidentally removed call to newAttempt for LAs

Mark slots as used
& Apparently I can't update year in license header

Stupid licenses

Tests are mostly working? Local timeouts that I think are unrelated.

Plumbed through & compiling, but a number of TODOS

Add interface, supporting classes, and fixed size impl
*
* @return the maximum number of slots that can ever be in use at one type for this slot type.
*/
default int maximumSlots() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default int maximumSlots() {
default int getMaximumSlots() {

Pedantic and don't have strong opinion here, just saw #2014 (comment) marked resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, must have marked it by accident.

this.userData = null;
}

public SlotPermit(Object userData) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Ok, no prob. Still not a big fan of the ID in here, but no prob there too. Mentioned in other comment, but we should probably make this class final, consider removing the equals/hashCode override (i.e. default to identity-based), and document on the context interfaces that the permits are the exact same instances as the ones returned.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly there, things left IMO:

  • At least make SlotPermit final and document we use same instances (should also consider removing unnecessarily ID and removing unnecessary equals/hashCode)
  • Figure out the worker options (would like to leave existing ones alone and just make mutually exclusive at build time, and would like a single option to set a tuner instead of 3)
  • Add @Experimental on all types in tuning package and on the new worker option(s)

private final String workerBuildId;
private final boolean fromStickyQueue;

public WorkflowSlotInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, but we should consider a POJO constructor for these fields too. People will be instantiating this to test their suppliers I think. (I wonder if we need to account for that with the context classes too and make those easy to make, but meh as interfaces they should be easy to mock)

import java.util.Objects;

/** Contains information about a slot that is being used to execute an activity task. */
public class ActivitySlotInfo extends SlotInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is gonna be annoying I know, but I think you should mark every class in this package and the worker options as @Experimental (this is an annotation in the io.temporal.common package).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker options are marked experimental as is the interface class, which seems sufficient to me, but sure I'll mark them all that way.

*/
@Experimental
public final class SlotPermit {
public final Object userData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's final, it's a bit of a faux pas to have public fields. Would prefer getter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Sushisource Sushisource force-pushed the slot-supplier branch 2 times, most recently from 253dace to a298f26 Compare April 9, 2024 22:06
@@ -71,12 +71,12 @@ public void markSlotUsed(SI slotInfo, SlotPermit permit) {
publishSlotsMetric();
}

public void releaseSlot(SlotReleaseReason reason, SlotPermit permit) {
public void releaseSlot(SlotReleaseReason reason, SlotPermit permit, String taskQueue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taskQueue should not be needed I think, it should always be the same as what was provided in SlotReservationData

@Quinn-With-Two-Ns
Copy link
Contributor

Can we make issue (if we don't already) for:

  • Pausable slot provider
  • Resource based slot provider
  • Springboot integration for slot supplier

}

public int getMaxConcurrentLocalActivityExecutionSize() {
return maxConcurrentLocalActivityExecutionSize;
return localActivitySlotSupplier.getMaximumSlots();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 getMaxConcurrent*ExecutionSize() should be zero if the corresponding maxConcurrent*ExecutionSize is not set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a buidlers setter should only effect one getter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah forgot to change those ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I was hoping not to retain the property in the built workflow options since having mutually exclusive stuff represented in the built type is less than good from a type design perspective. It also means the logic for the threadpool size has to get pushed out of the options, but, that's not so bad.

@Quinn-With-Two-Ns
Copy link
Contributor

Look good! my only issue is the handling of eager slots looks incorrect now since we made slots non fungible, my other points are more nits or things we can tackle later.

@Sushisource
Copy link
Member Author

Can we make issue (if we don't already) for:

* Pausable slot provider

* Resource based slot provider

* Springboot integration for slot supplier

Added #2030 and #2031

Won't add the resource one until we're done with the PoC, and then I'll make items for that everywhere.

@Sushisource Sushisource merged commit d2a06fc into temporalio:master Apr 10, 2024
7 checks passed
@Sushisource Sushisource deleted the slot-supplier branch April 10, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants