Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout ThreadGroups should not be destroyed #1652

Closed
akarnokd opened this issue Feb 11, 2020 · 27 comments · Fixed by #1687
Closed

Timeout ThreadGroups should not be destroyed #1652

akarnokd opened this issue Feb 11, 2020 · 27 comments · Fixed by #1687
Assignees
Milestone

Comments

@akarnokd
Copy link

Environment:

Eclipse: Version: 2019-09 R (4.13.0), Build id: 20190917-1200
EclEmma 3.1.2.201903112331 org.eclipse.eclemma.feature.feature.group Eclipse EclEmma
JUnit 4.13
Windows 10 x64
Java 1.8u241

Issue:

I've run into an odd error while running coverage on an unit test having some methods use Google Guava's CacheBuilder in Eclipse. The coverage fails with the error code 5013 and error report indicating java.net.SocketException: Connection reset. If I remove these methods, the coverage succeeds.

If I revert back to JUnit 4.12, both the test file and the individual test succeeds in producing the coverage data.

Example method:
https://github.com/ReactiveX/RxJava/blob/5f6aafcbaa97f330ad2007f4ad02c5e33db8914b/src/test/java/io/reactivex/rxjava3/internal/operators/flowable/FlowableGroupByTest.java#L1917

This is where the cache is created:
https://github.com/ReactiveX/RxJava/blob/5f6aafcbaa97f330ad2007f4ad02c5e33db8914b/src/test/java/io/reactivex/rxjava3/internal/operators/flowable/FlowableGroupByTest.java#L2194

Additional details:

  • If I run "gradlew test jacocoTestReport" on the entire project, it succeeds in creating the report HTML apparently.
  • If I run "Coverage As > JUnit test" on the entire project from Eclipse, it fails with the same 5013 error.
  • If I run any other test class/package/method in the project, the coverage is created successfully.

Please advise.

@marcphilipp
Copy link
Member

Could you please provide a stacktrace and/or Gradle build scan?

@akarnokd
Copy link
Author

There is no JVM crash file or failure stacktrace. Only an Eclipse error about losing connection (java.net.SocketException: Connection). I don't have any issues when I run the test(s) in Gradle from command line so not sure how that would help.

@marcphilipp
Copy link
Member

In that case, I suspect an Eclipse or EclEmma issue. Could you please bring it up with them?

@akarnokd
Copy link
Author

Okay. One last question. Could it be that the JUnit 4.13 integration has some trouble with Eclipse?

@marcphilipp
Copy link
Member

Well, there‘s always a chance we broke something. But we didn‘t change any APIs use by integrations as far as I know.

@kcooney
Copy link
Member

kcooney commented Feb 23, 2020

When I searched for the error message on StackOverflow I found https://stackoverflow.com/a/42904311/95725 which indicates that it might be a memory problem. Maybe with the upgrade to 4.13 you cross some kind of threshold memory-wise.

There is also always the possibility that you have multiple versions of JUnit on your classpath.

@akarnokd
Copy link
Author

Yes I tried that before posting to no avail. Also it is quite peculiar to fail only at those cache-methods and not for the other 12000+ tests. I've yet to hear back from the EclEmma people.

@akarnokd
Copy link
Author

akarnokd commented Mar 2, 2020

Hi. I got some feedback from the EclEmma people: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/jacoco/SHw3R6cvZSs/yaFgd7XUAQAJ

Looks like shutdown hooks are not executed the same way in 4.13 as it used to happen in 4.12 and thus the coverage failure.

@akarnokd
Copy link
Author

akarnokd commented Mar 2, 2020

Here is a JUnit demonstrating the issue (similar to how EclEmma would do it) that sometimes prints "hook executed", sometimes doesn't with 4.13:

import java.util.concurrent.TimeUnit;

import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.Timeout;

import com.google.common.cache.CacheBuilder;

public class ExampleTest {

  static {
    System.err.println("registering hook");
    Thread thread = new Thread(new Runnable() {
      @Override
      public void run() {
        System.err.println("hook executed");
      }
    });
    Runtime.getRuntime().addShutdownHook(thread);
    System.err.println(thread.getThreadGroup());
  }

  @Rule
  public Timeout timeout = new Timeout(5, TimeUnit.MINUTES);

  @Test
  public void test() {
    CacheBuilder.newBuilder() //
      .<Integer, Object>build();
  }
}

@Godin
Copy link
Contributor

Godin commented Mar 3, 2020

Hi guys 👋

@akarnokd thank you for forwarding here my responses from JaCoCo Mailing List.

@marcphilipp I think that the following happens in the case of the above example (#1652 (comment)).

After execution of static block there is one shutdown hook in java.lang.ApplicationShutdownHooks.hooks

{Thread@1552} "Thread[Thread-0,5,main]" -> {Thread@1552} "Thread[Thread-0,5,main]"

And after execution of CacheBuilder.newBuilder() there are two

{LogManager$Cleaner@1551} "Thread[Logging-Cleaner,5,FailOnTimeoutGroup]" -> {LogManager$Cleaner@1551} "Thread[Logging-Cleaner,5,FailOnTimeoutGroup]"
{Thread@1552} "Thread[Thread-0,5,main]" -> {Thread@1552} "Thread[Thread-0,5,main]"

The LogManager$Cleaner@1551 is created by java.util.logging.LogManager (https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/util/logging/LogManager.java#L253) because com.google.common.cache.CacheBuilder uses java.util.logging (https://github.com/google/guava/blob/v28.2/guava/src/com/google/common/cache/CacheBuilder.java#L226).

Note that it also belongs to FailOnTimeoutGroup created by org.junit.rules.Timeout (https://github.com/junit-team/junit4/blob/r4.13/src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java#L123).

So start of this hook after the test throws IllegalThreadStateException (at https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/lang/Thread.java#L709), because the group is already destroyed (https://github.com/junit-team/junit4/blob/r4.13/src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java#L140).

This exception stops the execution of subsequent hooks (https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L101-L103) and silently ignored (https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/lang/Shutdown.java#L115-L129).

The sequence in which hooks start depends on keySet (https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L97) of IdentityHashMap (https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L39) and that's why varies from run to run.

This doesn't happen in case of JUnit 4.12, because destroy of FailOnTimeoutGroup was added in JUnit 4.13 (63fd277). And this change also explains the difference between 4.12 and 4.13 observed during investigations (https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/jacoco/SHw3R6cvZSs/yaFgd7XUAQAJ) for the following example:

import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.Timeout;

import java.util.concurrent.TimeUnit;

public class ExampleTest {
  @Rule
  public Timeout globalTimeout = new Timeout(5, TimeUnit.MINUTES);

  @Test
  public final void test() {
    Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
      @Override
      public void run() {
        System.err.println("hook executed");
      }
    }));
  }
}

The same can be demonstrated without JUnit as following

import java.util.logging.LogManager;

public class Example {
  public static void main(String[] args) throws Exception {
    Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
      @Override
      public void run() {
        System.err.println("example hook");
      }
    }));

    ThreadGroup group = new ThreadGroup("example group");
    Thread thread = new Thread(group, new Runnable() {
      @Override
      public void run() {
        LogManager.getLogManager();
      }
    }, "example thread");
    thread.start();
    thread.join();
    group.destroy();
  }
}

So I was also thinking about creation of a ticket in OpenJDK, and while thinking about what can be suggested in such ticket, found already existing https://bugs.openjdk.java.net/browse/JDK-6704979

Nevertheless wondering if there is a chance to do something about this in JUnit.

@marcphilipp
Copy link
Member

@Godin Thanks for the detailed analysis! 👍

@kcooney Do you know/remember what the intention of using a ThreadGroup here was?

@kcooney
Copy link
Member

kcooney commented Mar 3, 2020

@marcphilipp my guess is this breakage was caused by 63fd27710 which was proposed by @elrodro83

The ThreadGroup was added in 1768d08d0

@marcphilipp
Copy link
Member

@akarnokd As a workaround, have you tried initializing JUL early, e.g. in a static initializer of a base test class?

@akarnokd
Copy link
Author

akarnokd commented Mar 4, 2020

No. I rarely run the full project coverage locally and so far, Travis CI managed to produce coverage data. I worked around by commenting out those CacheBuilder tests while I needed to work with the class being tested.

@jglick
Copy link
Contributor

jglick commented Apr 2, 2020

caused by 63fd277

I.e., #1517, for the backlink.

@kcooney
Copy link
Member

kcooney commented Apr 3, 2020

Hey, everyone. @jglick suggested in #1517 that we roll back #1517 or make it opt-in.

@boaks
Copy link

boaks commented Dec 29, 2020

My experience with the ThreadGroup.destroy()of PR #1517 isn't positive as well.

In Eclipse Californium it causes unit tests to fail randomly .
The randomness is caused by a "race" of terminating threads and the call of ThreadGroup.destroy().
And there is also a not to obvious order condition between the Timeout rule and static ThreadGroups. If the group is instantiated outside the Timeout rule, it's preserved, otherwise destroyed.

In my opinion, if such a feature is required, please

  1. add a "opt-in" (or "opt-out").
  2. One simple idea maybe to not "destroy", if any of the sub-groups are "none-daemon-groups" (daemon groups are destroy with the last terminating thread, for these groups the destroy should work. For none-daemon-groups, it may cause trouble, if such thread-groups are destroyed, they maybe considered to be static.)
  3. prevent/reduce the race-condition with terminating threads. One idea maybe to use a "ThreadGroup.interrupt()" before joining the "Thread". Maybe it's required to first join all Threads of the groups and sub-groups.

As temporary workaround I try to ensure, that all "static" thread-groups are initialized before any Timeout rule.
But that's not too easy to see.

@marcphilipp
Copy link
Member

I think an opt-out flag as a system property might suffice given the relatively low number of reports of this problem.

@kcooney What do you suggest we do?

@boaks
Copy link

boaks commented Dec 29, 2020

The Timeout rule has already a Builder. FMPOV it's easier to use, if a parameter is added there.

@kcooney
Copy link
Member

kcooney commented Dec 31, 2020

It's possible there are low reports of this problem because of low usage of 4.13. I don't think we had many reports of the ThreadGroup leakages either, and JUnit as been using ThreadGroup since 4.12 (see 1768d08).

If we hadn't released 4.13, I would suggest making the new behavior opt-in. If we make it opt-in now, then that seems like a visible enough change that we should make the new release 4.14.

If we make it opt in or opt out, I agree that we should use the Builder. JUnit 4.x rarely uses system properties to control behavior, and David Saff has objected to adding them in the past, so I'm not sure if that's a good option.

I wonder if instead of destroying the ThreadGroup we should just mark it as a daemon group. I think that would resolve the original issue (child ThreadGroups inherit the parent's daemon status) and we could do that change in a bug fix release.

@marcphilipp
Copy link
Member

I think making it a daemon thread group is a good idea. Does someone here have time to work on a PR?

kcooney added a commit to kcooney/junit that referenced this issue Jan 1, 2021
@kcooney kcooney self-assigned this Jan 1, 2021
@marcphilipp marcphilipp changed the title EclEmma coverage failure when running a test method with Guava Cache on 4.13 Timeout ThreadGroups should not be destroyed Jan 2, 2021
@marcphilipp marcphilipp added this to the 4.13.2 milestone Jan 2, 2021
kcooney added a commit that referenced this issue Jan 2, 2021
Mark ThreadGroup created by FailOnTimeout as a daemon group.

Previously, FailOnTimeout destroyed the ThreadGroup, which could cause race conditions if the ThreadGroup was referenced by other threads.

Fixes #1652
@j256
Copy link

j256 commented Jan 3, 2021

Can we somehow vote for 4.13.2 to be released sooner because of this? This is somewhat blocking me because of the security issues < 4.13.1. Thanks much for the work here.

FYI my minimum reproducible example can be found on this stackoverflow post and this github repo.

@kcooney
Copy link
Member

kcooney commented Jan 3, 2021

@j256 Thanks for the feedback. We haven't discussed timelines yet. I've started associating things with the 4.13.2 milestone.

@howlger
Copy link

howlger commented Jan 7, 2021

The following workaround works for me:

/** Workaround for the already fixed JUnit 4 issue #1652 (<https://github.com/junit-team/junit4/issues/1652>):
 * Can be used in JUnit 4.13 and 4.13.1 in tests with a timeout to prevent the thread group from being destroyed
 * and thus JaCoCo (used e.g. by EclEmma) loosing its collected data. */
public static void workaround_for_junit4_issue_1652() {
    String version = junit.runner.Version.id();
    if (!"4.13.1".equals(version) && !"4.13".equals(version))
        fail("Workaround for JUnit 4 issue #1652 required for JUnit 4.13 and 4.13.1 only; actual version: "
           + version);
    Thread thread = Thread.currentThread();
    if (!"Time-limited test".equals(thread.getName())) fail("Workaround only required for a test with a timeout.");
    new Thread(thread.getThreadGroup(), new Runnable() {
        @Override
        public void run() {
            try {
                while (!thread.isInterrupted()) Thread.sleep(100);
            } catch (InterruptedException e) {}
        }
    }).start();
}

@Test(timeout = 42)
public void test() {
    workaround_for_junit4_issue_1652();
    // ...
}

@j256
Copy link

j256 commented Feb 11, 2021

I'm still waiting for 4.13.2 with this fix. Any update @kcooney? Again, because of the security problems before 4.13.1, I'm stuck with now with a lot of projects that aren't publishing CI unit test information because of this bug. Thanks.

@marcphilipp
Copy link
Member

@j256 4.13.2 is now released.

@j256
Copy link

j256 commented Feb 13, 2021

Thanks much @marcphilipp !

henryju added a commit to SonarSource/orchestrator that referenced this issue May 22, 2023
henryju added a commit to SonarSource/orchestrator that referenced this issue May 23, 2023
Bombe added a commit to Bombe/jFCPlib that referenced this issue May 27, 2024
The JUnit update fixes a problem with coverage recording, see
jacoco/jacoco#1310 (comment) and
junit-team/junit4#1652.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants