Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in ML TooManyJobsIT on Debian 8 #66885

Closed
original-brownbear opened this issue Dec 30, 2020 · 27 comments
Closed

Failures in ML TooManyJobsIT on Debian 8 #66885

original-brownbear opened this issue Dec 30, 2020 · 27 comments
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

This has been failing a bunch of times on 7.10 recently:

https://gradle-enterprise.elastic.co/s/ppzyiud65lopu


org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testSingleNode FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-7] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=6}], because this node is full. Number of opened jobs [6], xpack.ml.max_open_jobs [6]]
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:FAEAE838E7C43CA8]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:129)
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
:x-pack:plugin:ccr:qa:restart:followClusterRestartTest
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=zh -Dtests.timezone=Etc/GMT-1 -Druntime.java=8

org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testMultipleNodes FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t2}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]|Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t3}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]|Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]]
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:B4CFD6BE498345BD]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testMultipleNodes(TooManyJobsIT.java:133)
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testMultipleNodes" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=zh -Dtests.timezone=Etc/GMT-1 -Druntime.java=8

Interestingly enough, instances of this failure coincide with the following REST test failure twice today.

REPRODUCE WITH: ./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=de-DE -Dtests.timezone=Africa/Ouagadougou -Druntime.java=8 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/ml_info/Test ml info} FAILED
    java.lang.AssertionError: Failure at [ml/ml_info:21]: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
    Expected: an instance of java.lang.String
         but: null
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:920E425A25F0F6EA]:0)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:414)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:391)
        at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.lang.Thread.run(Thread.java:748)

        Caused by:
        java.lang.AssertionError: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
        Expected: an instance of java.lang.String
             but: null
            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
            at org.junit.Assert.assertThat(Assert.java:956)
            at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:63)
            at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
            at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:407)
            ... 37 more
@original-brownbear original-brownbear added >test-failure Triaged test failures from CI :ml Machine learning v7.10.2 labels Dec 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

Interestingly enough, instances of this failure coincide with the following REST test failure twice today

Both these failures indicate a problem determining the amount of memory on the machine.

All the failures seem to happen on Debian 8. I think this is a special case of #66629.

@droberts195
Copy link
Contributor

This isn't just 7.10. https://gradle-enterprise.elastic.co/s/blt6fjge3bkus is an example from 7.x and https://gradle-enterprise.elastic.co/s/xtx7u77ntetcy is an example from 7.11.

Debian 8 isn't newly added to the test matrix, so I am not sure what changed 17 days ago when #66629 was opened.

The worry is that this isn't purely a test issue and is affecting end users on Debian 8.

@droberts195
Copy link
Contributor

Between https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/141/consoleFull (success) and https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/142/consoleFull (failure) the runtime JDK used on the Debian 8 CI workers was upgraded from 8u241 to 8u271 by the look of it, so maybe it’s the combination of Debian 8 with Java 8u271.

JDK 8 is no longer supported with 8.x, so this would also explain why the failures aren't seen on master.

Hopefully this will also limit the number of affected end users, as the bundled JDK is not Java 8.

@ywangd
Copy link
Member

ywangd commented Jan 5, 2021

Another bunch of similar failures: https://gradle-enterprise.elastic.co/s/a4wwanx2zcwyk

This time it's 7.x with JVM 14 on debian-8. The failed tests are:

org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}
org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=ml/ml_info/Test ml info}
org.elasticsearch.smoketest.MlWithSecurityIT test {yaml=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}
org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT testCloseUnassignedLazyJobAndDatafeed
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testMultipleNodes
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testSingleNode

And the error messages are as follows:

org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open} FAILED
    java.lang.AssertionError: Failure at [ml/jobs_crud:169]: node didn't match expected value:
                              node: expected String [] but was String [3SSSUCb2Rn2_Q6TZcgSliA]
org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/ml_info/Test ml info} FAILED
    java.lang.AssertionError: Failure at [ml/ml_info:21]: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
    Expected: an instance of java.lang.String
         but: null
org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testSingleNode FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-7] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=6}{ml.max_jvm_size=517996544}], because this node is full. Number of opened jobs [6], xpack.ml.max_open_jobs [6]]
        at __randomizedtesting.SeedInfo.seed([DCE06580A050FE78:3C50F038CC9859C2]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:129)

A new failure message is

org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT > testCloseUnassignedLazyJobAndDatafeed FAILED
    java.lang.AssertionError: expected:<opening> but was:<opened>
        at __randomizedtesting.SeedInfo.seed([DCE06580A050FE78:52F6FE440E82148C]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:144)
        at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testCloseUnassignedLazyJobAndDatafeed(BasicDistributedJobsIT.java:449)

It is not reproducible with:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testCloseUnassignedLazyJobAndDatafeed" -Dtests.seed=DCE06580A050FE78 -Dtests.security.manager=true -Dtests.locale=fr -Dtests.timezone=Atlantic/St_Helena -Druntime.java=8

@ywangd
Copy link
Member

ywangd commented Jan 5, 2021

Another one, this time is 6.8 on debian-8: https://gradle-enterprise.elastic.co/s/i73shvns4cngk

Three failures:

org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT testJobRelocationIsMemoryAware
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testMultipleNodes
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testSingleNode

and the error messages are similar:

FAILURE 2.44s J11 | TooManyJobsIT.testSingleNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-13] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=12}{ml.enabled=true}], because this node is full. Number of opened jobs [12], xpack.ml.max_open_jobs [12]]
   > 	at __randomizedtesting.SeedInfo.seed([24BE172061E6C688:C40E82980D2E6132]:0)
   > 	at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:165)
   > 	at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:127)
   > 	at java.lang.Thread.run(Thread.java:748)

Reproduce line:

./gradlew ':x-pack:plugin:ml:internalClusterTest' -Dtests.seed=24BE172061E6C688 -Dtests.class=org.elasticsearch.xpack.ml.integration.TooManyJobsIT -Dtests.method="testSingleNode" -Dtests.security.manager=true -Dtests.locale=cs-CZ -Dtests.timezone=America/Santo_Domingo -Dcompiler.java=11 -Druntime.java=8

Not reproducible.

@ywangd
Copy link
Member

ywangd commented Jan 5, 2021

Another similar one for 7.11 on debian-8: https://gradle-enterprise.elastic.co/s/4yd24wpnfyfom

@droberts195 droberts195 changed the title Failures in ML TooManyJobsIT on 7.10 Failures in ML TooManyJobsIT on Debian 8 Jan 5, 2021
@droberts195
Copy link
Contributor

@ywangd just wanted to clarify when you say “not reproducible” are you trying on Debian 8? (I am not saying you should try on Debian 8, just that since every failure has happened on Debian 8 it’s probably not worth bothering on other distributions.)

@ywangd
Copy link
Member

ywangd commented Jan 5, 2021

@ywangd just wanted to clarify when you say “not reproducible” are you trying on Debian 8? (I am not saying you should try on Debian 8, just that since every failure has happened on Debian 8 it’s probably not worth bothering on other distributions.)

No I didn't try on Debian 8, it was on MacOS. I should have made it explicit. The original title had 7.10 in it and other 7.x versions were also mentioned. But I was seeing a failure from 6.8, so I thought I'd give it a try locally. As you said, I didn't bother with it for both 7.x and 7.11.

@breskeby breskeby added v7.10.3 and removed v7.10.2 labels Jan 6, 2021
@probakowski
Copy link
Contributor

Another one in 7.x: https://gradle-enterprise.elastic.co/s/d3hwrnww2wbie

@droberts195
Copy link
Contributor

#67089 (comment) contains the likely explanation. I'm not sure what changed in mid-December though that made this start failing.

@henningandersen
Copy link
Contributor

henningandersen commented Jan 11, 2021

The JDK 8 version on debian 8 was upgraded between December 15th and December 17th, at which time OsProbeTests started failing because memory is 0.

December 15th build using jdk 8u241:
https://gradle-enterprise.elastic.co/s/p7ck5qdodvx7s/console-log?task=:x-pack:plugin:searchable-snapshots:internalClusterTest#L10996

December 17th build, first OsProbeTests failure, using 8u271:
https://gradle-enterprise.elastic.co/s/iffwufwtdxcjk/console-log?task=:rest-api-spec:yamlRestTest#L7144

This java bug was marked fixed for 8u272:
https://bugs.openjdk.java.net/browse/JDK-8251515

Checking the code of oracle java 8u271, it does include at least the java parts of the change, in which a missing memory subsystem is interpreted as 0 memory.

Given that this is fixed in java 15 and at least in the past, it was normal to not have a memory subsystem, it looks like a java bug.

@droberts195
Copy link
Contributor

in which a missing memory subsystem is interpreted as 0 memory.

I had a look on some CI servers for various supported platforms to see what this looks like in the file system.

Debian 8:

droberts195@elasticsearch-ci-immutable-debian-8-1610362485218858680:~$ grep -i cgroup /boot/config-3.16.0-11-amd64 
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-8-1610362485218858680:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 2 root root  0 Jan 11 10:58 blkio
lrwxrwxrwx 1 root root 11 Jan 11 10:56 cpu -> cpu,cpuacct
dr-xr-xr-x 2 root root  0 Jan 11 10:58 cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 11 10:56 cpuacct -> cpu,cpuacct
dr-xr-xr-x 2 root root  0 Jan 11 10:58 cpuset
dr-xr-xr-x 2 root root  0 Jan 11 10:58 devices
dr-xr-xr-x 2 root root  0 Jan 11 10:58 freezer
lrwxrwxrwx 1 root root 16 Jan 11 10:56 net_cls -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Jan 11 10:58 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 11 10:56 net_prio -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Jan 11 10:58 perf_event
dr-xr-xr-x 4 root root  0 Jan 11 10:58 systemd

Debian 9:

droberts195@elasticsearch-ci-immutable-debian-9-1610467520318032810:~$ grep -i cgroup /boot/config-4.9.0-14-amd64 
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-9-1610467520318032810:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 6 root root  0 Jan 12 16:09 blkio
lrwxrwxrwx 1 root root 11 Jan 12 16:07 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 12 16:07 cpuacct -> cpu,cpuacct
dr-xr-xr-x 6 root root  0 Jan 12 16:09 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Jan 12 16:09 cpuset
dr-xr-xr-x 6 root root  0 Jan 12 16:09 devices
dr-xr-xr-x 3 root root  0 Jan 12 16:09 freezer
dr-xr-xr-x 6 root root  0 Jan 12 16:09 memory
lrwxrwxrwx 1 root root 16 Jan 12 16:07 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:09 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 12 16:07 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:09 perf_event
dr-xr-xr-x 6 root root  0 Jan 12 16:09 pids
dr-xr-xr-x 6 root root  0 Jan 12 16:09 systemd

Debian 10:

droberts195@elasticsearch-ci-immutable-debian-10-1610467306619099145:~$ grep -i cgroup /boot/config-4.19.0-13-cloud-amd64 
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_BLK_CGROUP_IOLATENCY is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-10-1610467306619099145:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 5 root root  0 Jan 12 16:04 blkio
lrwxrwxrwx 1 root root 11 Jan 12 16:02 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 12 16:02 cpuacct -> cpu,cpuacct
dr-xr-xr-x 5 root root  0 Jan 12 16:04 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Jan 12 16:04 cpuset
dr-xr-xr-x 5 root root  0 Jan 12 16:04 devices
dr-xr-xr-x 3 root root  0 Jan 12 16:04 freezer
dr-xr-xr-x 5 root root  0 Jan 12 16:04 memory
lrwxrwxrwx 1 root root 16 Jan 12 16:02 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:04 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 12 16:02 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:04 perf_event
dr-xr-xr-x 5 root root  0 Jan 12 16:04 pids
dr-xr-xr-x 2 root root  0 Jan 12 16:04 rdma
dr-xr-xr-x 6 root root  0 Jan 12 16:04 systemd
dr-xr-xr-x 5 root root  0 Jan 12 16:04 unified

CentOS 7:

[droberts195@elasticsearch-ci-immutable-centos-7-pkg-1610468233716122392 ~]$ grep -i cgroup /boot/config-3.10.0-1160.11.1.el7.x86_64 
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=y
CONFIG_NETPRIO_CGROUP=y

[droberts195@elasticsearch-ci-immutable-centos-7-pkg-1610468233716122392 ~]$ ls -l /sys/fs/cgroup
total 0
drwxr-xr-x. 4 root root  0 Jan 12 16:17 blkio
lrwxrwxrwx. 1 root root 11 Jan 12 16:17 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 Jan 12 16:17 cpuacct -> cpu,cpuacct
drwxr-xr-x. 4 root root  0 Jan 12 16:17 cpu,cpuacct
drwxr-xr-x. 2 root root  0 Jan 12 16:17 cpuset
drwxr-xr-x. 4 root root  0 Jan 12 16:17 devices
drwxr-xr-x. 2 root root  0 Jan 12 16:17 freezer
drwxr-xr-x. 2 root root  0 Jan 12 16:17 hugetlb
drwxr-xr-x. 4 root root  0 Jan 12 16:17 memory
lrwxrwxrwx. 1 root root 16 Jan 12 16:17 net_cls -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 Jan 12 16:17 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 Jan 12 16:17 net_prio -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 Jan 12 16:17 perf_event
drwxr-xr-x. 4 root root  0 Jan 12 16:17 pids

CentOS 6:

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ grep -i cgroup /boot/config-2.6.32-754.35.1.el6.x86_64 
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_PERF=y
CONFIG_NET_CLS_CGROUP=y
CONFIG_NETPRIO_CGROUP=y

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ ls -l /sys/fs/cgroup
ls: cannot access /sys/fs/cgroup: No such file or directory

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ ls -l /cgroup
total 0

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 cgroup]$ lscgroup
cgroups can't be listed: Cgroup is not mounted

There is a comment in https://stackoverflow.com/questions/21337522/trying-to-use-cgroups-in-debian-wheezy-and-no-daemons that "Debian disables the memory subsystem by default in the kernel, so you need to activate it if you need it". That dates back to Wheezy (Debian 7), so it appears that that statement was still true in Jessie (8). But by Stretch (9) it was no longer the case.

CentOS 6 appears not to mount cgroups by default, and this doesn't appear to confuse Java 8u271, so it must be that if the memory subsystem is missing but everything is missing then that's OK.

Ubuntu is based on Debian so I suspect the versions based on Debian 7/8 will suffer the same problem. Thankfully this doesn't impact our support matrix enormously, as the last such Ubuntu version was 15.10. Ubuntu 16.04 is based on Debian 9 and that's the oldest we support in ES 7.x. Based on this the only currently supported combination affected apart from Debian 8 would be ES 6.8 on Ubuntu 14.04.

To summarize, this problem affects:

  1. ES 7.x and 6.8 on Debian 8
  2. ES 6.8 on Ubuntu 14.04

Then the workarounds would be to either enable the memory subsystem - instructions in https://dawnbringer.net/blog/1033/cgroup%20support - or else upgrade Java to a fixed version.

Since failure to obtain the amount of memory on a node is really bad for ML we will document this as a known issue for ML.

@droberts195
Copy link
Contributor

The 4 affected tests will be selectively muted on Debian 8 when #67422 is merged and backported.

droberts195 added a commit that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates #66885
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in elastic#67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates elastic#66885
Backport of elastic#67422
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in elastic#67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates elastic#66885
Backport of elastic#67422
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in elastic#67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates elastic#66885
Backport of elastic#67422
droberts195 added a commit that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates #66885
Backport of #67422
droberts195 added a commit that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates #66885
Backport of #67422
droberts195 added a commit that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates #66885
Backport of #67422
droberts195 added a commit that referenced this issue Jan 13, 2021
The selective muting implemented for autoscaling in #67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates #66885
Backport of #67422
@astefan
Copy link
Contributor

astefan commented Jan 15, 2021

@droberts195 I've spotted this morning another failure for tests already mentioned in this issue. 7.11 with Debian 8. Judging by the investigation done and the code merge two days ago, there shouldn't be any more failures. For the failing tests, should we proactively mute them using willSufferDebian8MemoryProblem() assumption?

Build scan: https://gradle-enterprise.elastic.co/s/d4utquzs42lsm
A snippet from the error:

        java.lang.AssertionError: node didn't match expected value:
                                  node: expected String [] but was String [bqImCgWoQqST2HEEZE2T7g]
            at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:93)
            at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
            at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:409)

@droberts195
Copy link
Contributor

For the failing tests, should we proactively mute them using willSufferDebian8MemoryProblem() assumption?

The two failing tests are YAML tests. I don't know a way to selectively mute based on a complex condition in the skip section of a YAML test. It's probably best to mute them altogether to stop the noise. I believe a fix is being worked on for the underlying problem in #66629 so hopefully they won't have to be muted forever.

@martijnvg
Copy link
Member

I noticed the following 3 yaml test failures today on Debian 8:

The tests (XPackRestIT » test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}, MlWithSecurityIT » test {yaml=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open} and XPackRestIT » test {p0=ml/ml_info/Test ml info}) have assertion failures which have been mentioned in this issue (tests in TooManyJobsIT no longer fail as these have been muted).

I think these failures are related to this issue. Would someone be able to confirm this?
I think #67681, would be great in order to selectively mute these tests for debian 8.

@droberts195
Copy link
Contributor

I think these failures are related to this issue. Would someone be able to confirm this?

Yes, all the test failures on Debian 8 that relate to memory being reported as zero are basically the same thing.

I think #67681, would be great in order to selectively mute these tests for debian 8.

👍

@albertzaharovits
Copy link
Contributor

Tests still falling, eg https://gradle-enterprise.elastic.co/s/nmzh3f4juwn4e .
@hendrikmuhs I think it's worth pushing #67681 to completion.

@cbuescher
Copy link
Member

Some tests fitting this issue still fail today on 7.10, e.g.
https://gradle-enterprise.elastic.co/s/m76jhemdg5bqq

Not sure if this should or shouldn't happen any more with elastic/infra#26251 being merged, maybe @albertzaharovits can see whether this is some different configuration/issue or not?

@cbuescher
Copy link
Member

Also 7.x just now: https://gradle-enterprise.elastic.co/s/nndh4zezzcr7i

@cbuescher
Copy link
Member

Looks like at least https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.11+multijob-unix-compatibility/os=debian-8&&immutable/86/ is using oracle-8u281 now but apparently still experiencing similar issues like reported here.

@cbuescher
Copy link
Member

I muted the failing yml tests in 7.x, 7.11 and 7.10 now. Please remove the general skip on all plattforms when #67681 makes it possible to be more selective here.

@cbuescher
Copy link
Member

fyi d975922, 4e19256 and 7bcb3fc

@hendrikmuhs
Copy link
Contributor

Tests still falling, eg gradle-enterprise.elastic.co/s/nmzh3f4juwn4e .
@hendrikmuhs I think it's worth pushing #67681 to completion.

#67681 is ready for review

@droberts195
Copy link
Contributor

droberts195 commented Feb 8, 2021

This should be fixed by #68542.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 22, 2021
Unmute the YAML tests that were muted due to the problem
of elastic#66885.

The underlying problem was fixed by elastic#68542.
droberts195 added a commit that referenced this issue Feb 22, 2021
Unmute the YAML tests that were muted due to the problem
of #66885.

The underlying problem was fixed by #68542.
droberts195 added a commit that referenced this issue Feb 22, 2021
Unmute the YAML tests that were muted due to the problem
of #66885.

The underlying problem was fixed by #68542.
droberts195 added a commit that referenced this issue Feb 22, 2021
Unmute the YAML tests that were muted due to the problem
of #66885.

The underlying problem was fixed by #68542.
alyokaz pushed a commit to alyokaz/elasticsearch that referenced this issue Mar 10, 2021
The selective muting implemented for autoscaling in elastic#67159
is extended to the ML tests that also fail when machine
memory is reported as 0.

Most of the logic to determine when memory will not be
accurately reported is now in a utility method in the
base class.

Relates elastic#66885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests