Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread Tracker #336

Open
numeralnathan opened this issue Mar 4, 2022 · 10 comments
Open

Thread Tracker #336

numeralnathan opened this issue Mar 4, 2022 · 10 comments

Comments

@numeralnathan
Copy link

Please create a thread tracker policy. It tracks what threads are currently executing inside the policy. Please add the following methods to the policy.

  • getThreads() returns a snapshot of the current threads executing inside the policy
  • isExecuting(Thread thread) returns true if the given thread is currently executing inside the policy
  • getCallStacks() returns a String of the call stacks of all the threads currently executing inside the policy

Let's say a master thread executes inside a timeout policy. While doing so, it creates a bunch of child tasks to execute using a thread pool. The master thread then waits for the child tasks to finish. If the master thread times out, the thread tracker policy can be used to log the call stacks of the threads executing the child tasks. Hence, one can diagnose where the child tasks hung.

@Tembrel
Copy link
Contributor

Tembrel commented Mar 4, 2022

This seems like a very big ask. In general, Failsafe doesn't have access to the innards of the scheduler being used. Even if it did, you wouldn't want to saddle all Failsafe users with the performance hit of keeping track of all those threads.

Instead, you'd want to be able to opt in. I guess one could provide an instrumented scheduler/thread pool that knew how to look into the running states of its threads, but that would inevitably slow down and distort the operation of the pool. This might be acceptable in some contexts, but it would be very hard to provide guidance on how and when to use this.

@numeralnathan
Copy link
Author

Why would Failsafe need access to the scheduler? The thread tracker can capture Thread.currentThread() before and after execution. For example...

private static final ThreadTracker<Widget> TRACKER = ThreadTracker.builder().build();

public Widget createWidget()
{
   return TRACKER.get(() -> makeWidget());
}

private Widget makeWidget()
{
   return new Widget();  // An operation that could block for a long time
}

public static void logCallStacks()
{
   LOGGER.warn(TRACKER.getCallStacks());
}

The TRACKER is only involved in creating Widgets. This won't impact all the threads. When another thread detects a problem, it can call logCallStacks() to dump diagnostic information.

Since the thread tracker is a policy, the programmer has to create a thread tracker policy and then use it. The programmer will have to be informed of the scalability problem if too many threads hit the policy too quickly. However, I suspect the lock contention to be low since the tracking part will be something like.

private final Map<Thread, Integer> m_threads = new HashMap<>();

public void addThread(Thread thread)
{
   synchronized (m_threads)
   {
      m_threads.compute(thread, (key, count) -> count == null ? 1 : count + 1);
   }
}

public void removeThread(Thread thread)
{
   synchronized (m_threads)
   {
      m_threads.compute(thread, (key, count) -> (count == null) || (count == 1) ? null : count - 1);
   }
}

As you can see the critical region of the synchronized statements is quite small.

@jhalterman
Copy link
Member

jhalterman commented Mar 4, 2022

So the goal for this idea is to be able to identify threads that appear to be blocked and to dump call stack for Failsafe threads? Presumably, we'd also want a way to indicate which threads are associated with which Failsafe policies or tasks, such as a Timeout thread, retry, etc?

Even if it did, you wouldn't want to saddle all Failsafe users with the performance hit of keeping track of all those threads.

Definitely. Unless the tracking is very light, it should be optional.

@numeralnathan
Copy link
Author

Perhaps, I am asking for Timeout to allow for configuration by the programmer to dump the call stack of the executing thread when the deadline is reached. Tracking the executing thread can be done with something as simple as...

private volatile Thread m_thread;

public R execute()
{
   m_thread = Thread.currentThread();

   try
   {
      return /* do the operation */;
   }
   finally
   {
      m_thread = null;
   }
}

@jhalterman
Copy link
Member

In describing this ability in the docs, what would you say it's used for? Maybe you could talk more about your use case.

It's worth noting that Failsafe uses threads for a few different things. Are you just interested in tracking threads that are used for executing user-provided suppliers/runnables, or other threads that Failsafe uses internally also, such as for waiting on a retry, timeout, or rate limiter permit?

@numeralnathan
Copy link
Author

I am interested in tracking threads that are used for executing user-provided suppliers and runnables.

I have a program that launches a bunch of async tasks as a group. The entire group must complete before moving on. I use Timeout to ensure to flag a problem when the group takes too long. I would like to log the call stacks of the async tasks at the timeout so I can see where the async tasks hung.

@numeralnathan
Copy link
Author

If Timeout were enhanced to capture the call stack at the timeout of the thread executing the user-provided supplier or runnable, then the onFailure() could log that call stack.

@numeralnathan
Copy link
Author

I am looking at TimeoutExecutor.apply(). It can be enhanced to capture the call stack of the thread at the timeout.

On the line before timeoutFuture = Scheduler.DEFAULT.schedule(() -> {, add the line Thread thread = Thread.currentThread();.

On the line before ExecutionResult<R> cancelResult = ExecutionResult.exception(new TimeoutExceededException(policy));, add the line StackTraceElement stack[] = thread.getCallStack(); and pass stack to TimeoutExceededException constructor.

Add to TimeoutExceededException the method public StackTraceElement[] getCallStack().

Does this need an opt-in configuration? Yes.

The result of Thread.currentThread() comes from a register so this call will have zero impact. Passing the thread to the cancellation Callable lambda will use an extra 4-8 bytes of heap. These aren't the problem.

The problem is that calling thread.getCallStack() can have a significant impact on some JVM versions. Older JVMs will require a stop the world. Newer JVMs simply tell the thread to capture its call stack. When the thread does its safepoint, it will execute the getCallStack() operation and continue on its way. If the thread is blocked, then the calling thread does the work and momentarily prevents the thread from waking up. So for newer JVMs the impact is insignificant.

@jhalterman
Copy link
Member

jhalterman commented Mar 4, 2022

Grabbing the thread refs is not a challenge, DelegatingScheduler, which is used to schedule executions against an ExecutorService/ForkJoinPool, already does this to add interruption support to ForkJoinPool threads. The biggest challenge for me is making a feature like this fit nicely into the API, since the use case might evolve over time as it's better understood, leading to potential breaking changes in the API.

Right now Failsafe only allows a single execution at a time, but it may allow concurrent executions under certain scenarios in the future (via #231 and #291). Assuming the status quo remains, an API would probably only return a single thread, ex: Thread getExecutionThread().

Some possible places that getExecutionThread() could be exposed in the API include:

  1. On the FailsafeExecutor instance
  2. On an ExecutionCompletedEvent available via one of the policy event listeners.

I somewhat like option 2 better since it implies that an execution thread is tied to an individual execution attempt or moment in time (such as whenever a timeout happens to trigger), and can be made immutable in that event.

There are caveats to exposing the execution thread since by the time a user fetches it via getExecutionThread(), or sometime after, the thread could have been returned to a pool and is now doing something else. That something else could be something outside of Failsafe, or it could be another Failsafe execution attempt, in which case grabbing a thread dump could be misleading. This whole problem is only avoided if we don't expose threads after all, and instead prefer something like getCallStack which could only return something if the thread is still in use.

On the line before ExecutionResult cancelResult = ExecutionResult.exception(new TimeoutExceededException(policy));, add the line StackTraceElement stack[] = thread.getCallStack(); and pass stack to TimeoutExceededException constructor.

For async executions, the thread that sets up a timer might not be the same thread that later runs a Supplier/Runnable. We intentionally don't use a separate thread for the Supplier/Runnable until just before they're executed, in case the execution never happens, such as when a CircuitBreaker or RateLimiter prevents the executions.

@numeralnathan
Copy link
Author

numeralnathan commented Mar 4, 2022

To accommodate async, we can change the solution for both async and sync. TimeoutExecutor can implement something like...

private volatile Thread m_thread;

public void preExecute() {
   m_thread = Thread.currentThread();
}

public void preExecuteAsync() {
   m_thread = Thread.currentThread();
}

Scenario 1: The Execution Thread has not Started

Let's say the execution thread has not started because there is a delay in its start (e.g., RateLimiter). Let's say the cancel thread; however, managed to start.

  1. The cancel thread sees that m_thread is null.
  2. The cancel thread then passes null to TimeoutExceededException.
  3. When the user code eventually catches the TimeoutExceededException, the user code will see null for the stack and the Javadoc will say that null means the execution thread never started.

Scenario 2: The Execution Thread is Running

Let's say the execution thread started the operation but has not finished. Let's say the cancel thread starts running.

  1. The cancel thread sees that m_thread is not null and captures the execution thread's stack.
  2. The cancel thread passes the stack to TimeoutExceeededException.
  3. The cancel thread then executes resultRef.compareAndSet() and this returns true because the execution thread has not finished. This means the call stack is valid.
  4. When the user code catches the TimeoutExceededException, the user code will be able to log the stack.

Scenario 3: The Execution Thread Finished

Let's say the execution thread started the operation but has not finished. Let's say the cancel thread starts running.

  1. The execution thread finishes and does resultRef.compareAndSet().
  2. The execution thread goes back to the pool and gets something else to work on.
  3. The cancel thread then captures the execution thread's stack. The stack is for something else entirely.
  4. The cancel thread passes the stack to TimeoutExceeededException.
  5. The cancel thread executes resultRef.compareAndSet() and this returns false because the execution thread has already finished.
  6. Hence, the stack and exception are never seen by user code.

Because of resultRef.compareAndSet(), the stack is guaranteed to be null or valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants