Failures in ML TooManyJobsIT on Debian 8 #66885

original-brownbear · 2020-12-30T10:44:58Z

This has been failing a bunch of times on 7.10 recently:

https://gradle-enterprise.elastic.co/s/ppzyiud65lopu


org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testSingleNode FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-7] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=6}], because this node is full. Number of opened jobs [6], xpack.ml.max_open_jobs [6]]
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:FAEAE838E7C43CA8]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:129)
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
:x-pack:plugin:ccr:qa:restart:followClusterRestartTest
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=zh -Dtests.timezone=Etc/GMT-1 -Druntime.java=8

org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testMultipleNodes FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t2}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]|Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t3}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]|Not opening job [max-number-of-jobs-limit-job-10] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=3}], because this node is full. Number of opened jobs [3], xpack.ml.max_open_jobs [3]]
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:B4CFD6BE498345BD]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testMultipleNodes(TooManyJobsIT.java:133)
REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testMultipleNodes" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=zh -Dtests.timezone=Etc/GMT-1 -Druntime.java=8

Interestingly enough, instances of this failure coincide with the following REST test failure twice today.

REPRODUCE WITH: ./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}" -Dtests.seed=1A5A7D808B0C9B12 -Dtests.security.manager=true -Dtests.locale=de-DE -Dtests.timezone=Africa/Ouagadougou -Druntime.java=8 -Dtests.rest.blacklist=getting_started/10_monitor_cluster_health/*

org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/ml_info/Test ml info} FAILED
    java.lang.AssertionError: Failure at [ml/ml_info:21]: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
    Expected: an instance of java.lang.String
         but: null
        at __randomizedtesting.SeedInfo.seed([1A5A7D808B0C9B12:920E425A25F0F6EA]:0)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:414)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:391)
        at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.lang.Thread.run(Thread.java:748)

        Caused by:
        java.lang.AssertionError: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
        Expected: an instance of java.lang.String
             but: null
            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
            at org.junit.Assert.assertThat(Assert.java:956)
            at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:63)
            at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
            at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:407)
            ... 37 more

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-12-30T10:45:00Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2021-01-04T12:23:28Z

Interestingly enough, instances of this failure coincide with the following REST test failure twice today

Both these failures indicate a problem determining the amount of memory on the machine.

All the failures seem to happen on Debian 8. I think this is a special case of #66629.

droberts195 · 2021-01-04T12:29:39Z

This isn't just 7.10. https://gradle-enterprise.elastic.co/s/blt6fjge3bkus is an example from 7.x and https://gradle-enterprise.elastic.co/s/xtx7u77ntetcy is an example from 7.11.

Debian 8 isn't newly added to the test matrix, so I am not sure what changed 17 days ago when #66629 was opened.

The worry is that this isn't purely a test issue and is affecting end users on Debian 8.

droberts195 · 2021-01-04T14:56:33Z

Between https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/141/consoleFull (success) and https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/142/consoleFull (failure) the runtime JDK used on the Debian 8 CI workers was upgraded from 8u241 to 8u271 by the look of it, so maybe it’s the combination of Debian 8 with Java 8u271.

JDK 8 is no longer supported with 8.x, so this would also explain why the failures aren't seen on master.

Hopefully this will also limit the number of affected end users, as the bundled JDK is not Java 8.

ywangd · 2021-01-05T00:10:29Z

Another bunch of similar failures: https://gradle-enterprise.elastic.co/s/a4wwanx2zcwyk

This time it's 7.x with JVM 14 on debian-8. The failed tests are:

org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}
org.elasticsearch.xpack.test.rest.XPackRestIT test {p0=ml/ml_info/Test ml info}
org.elasticsearch.smoketest.MlWithSecurityIT test {yaml=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}
org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT testCloseUnassignedLazyJobAndDatafeed
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testMultipleNodes
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testSingleNode

And the error messages are as follows:

org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open} FAILED
    java.lang.AssertionError: Failure at [ml/jobs_crud:169]: node didn't match expected value:
                              node: expected String [] but was String [3SSSUCb2Rn2_Q6TZcgSliA]

org.elasticsearch.xpack.test.rest.XPackRestIT > test {p0=ml/ml_info/Test ml info} FAILED
    java.lang.AssertionError: Failure at [ml/ml_info:21]: field [limits.effective_max_model_memory_limit] was expected to be of type String but is an instanceof [null]
    Expected: an instance of java.lang.String
         but: null

org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testSingleNode FAILED
    java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-7] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=6}{ml.max_jvm_size=517996544}], because this node is full. Number of opened jobs [6], xpack.ml.max_open_jobs [6]]
        at __randomizedtesting.SeedInfo.seed([DCE06580A050FE78:3C50F038CC9859C2]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:172)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:129)

A new failure message is

org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT > testCloseUnassignedLazyJobAndDatafeed FAILED
    java.lang.AssertionError: expected:<opening> but was:<opened>
        at __randomizedtesting.SeedInfo.seed([DCE06580A050FE78:52F6FE440E82148C]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:144)
        at org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testCloseUnassignedLazyJobAndDatafeed(BasicDistributedJobsIT.java:449)

It is not reproducible with:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testCloseUnassignedLazyJobAndDatafeed" -Dtests.seed=DCE06580A050FE78 -Dtests.security.manager=true -Dtests.locale=fr -Dtests.timezone=Atlantic/St_Helena -Druntime.java=8

ywangd · 2021-01-05T03:21:12Z

Another one, this time is 6.8 on debian-8: https://gradle-enterprise.elastic.co/s/i73shvns4cngk

Three failures:

org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT testJobRelocationIsMemoryAware
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testMultipleNodes
org.elasticsearch.xpack.ml.integration.TooManyJobsIT testSingleNode

and the error messages are similar:

FAILURE 2.44s J11 | TooManyJobsIT.testSingleNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: Could not open job because no suitable nodes were found, allocation explanation [Not opening job [max-number-of-jobs-limit-job-13] on node [{node_t1}{ml.machine_memory=0}{ml.max_open_jobs=12}{ml.enabled=true}], because this node is full. Number of opened jobs [12], xpack.ml.max_open_jobs [12]]
   > 	at __randomizedtesting.SeedInfo.seed([24BE172061E6C688:C40E82980D2E6132]:0)
   > 	at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.verifyMaxNumberOfJobsLimit(TooManyJobsIT.java:165)
   > 	at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testSingleNode(TooManyJobsIT.java:127)
   > 	at java.lang.Thread.run(Thread.java:748)

Reproduce line:

./gradlew ':x-pack:plugin:ml:internalClusterTest' -Dtests.seed=24BE172061E6C688 -Dtests.class=org.elasticsearch.xpack.ml.integration.TooManyJobsIT -Dtests.method="testSingleNode" -Dtests.security.manager=true -Dtests.locale=cs-CZ -Dtests.timezone=America/Santo_Domingo -Dcompiler.java=11 -Druntime.java=8

Not reproducible.

ywangd · 2021-01-05T05:04:07Z

Another similar one for 7.11 on debian-8: https://gradle-enterprise.elastic.co/s/4yd24wpnfyfom

droberts195 · 2021-01-05T08:08:32Z

@ywangd just wanted to clarify when you say “not reproducible” are you trying on Debian 8? (I am not saying you should try on Debian 8, just that since every failure has happened on Debian 8 it’s probably not worth bothering on other distributions.)

ywangd · 2021-01-05T08:25:11Z

@ywangd just wanted to clarify when you say “not reproducible” are you trying on Debian 8? (I am not saying you should try on Debian 8, just that since every failure has happened on Debian 8 it’s probably not worth bothering on other distributions.)

No I didn't try on Debian 8, it was on MacOS. I should have made it explicit. The original title had 7.10 in it and other 7.x versions were also mentioned. But I was seeing a failure from 6.8, so I thought I'd give it a try locally. As you said, I didn't bother with it for both 7.x and 7.11.

probakowski · 2021-01-07T12:12:09Z

Another one in 7.x: https://gradle-enterprise.elastic.co/s/d3hwrnww2wbie

droberts195 · 2021-01-07T14:11:37Z

#67089 (comment) contains the likely explanation. I'm not sure what changed in mid-December though that made this start failing.

henningandersen · 2021-01-11T11:24:40Z

The JDK 8 version on debian 8 was upgraded between December 15th and December 17th, at which time OsProbeTests started failing because memory is 0.

December 15th build using jdk 8u241:
https://gradle-enterprise.elastic.co/s/p7ck5qdodvx7s/console-log?task=:x-pack:plugin:searchable-snapshots:internalClusterTest#L10996

December 17th build, first OsProbeTests failure, using 8u271:
https://gradle-enterprise.elastic.co/s/iffwufwtdxcjk/console-log?task=:rest-api-spec:yamlRestTest#L7144

This java bug was marked fixed for 8u272:
https://bugs.openjdk.java.net/browse/JDK-8251515

Checking the code of oracle java 8u271, it does include at least the java parts of the change, in which a missing memory subsystem is interpreted as 0 memory.

Given that this is fixed in java 15 and at least in the past, it was normal to not have a memory subsystem, it looks like a java bug.

droberts195 · 2021-01-12T16:52:32Z

in which a missing memory subsystem is interpreted as 0 memory.

I had a look on some CI servers for various supported platforms to see what this looks like in the file system.

Debian 8:

droberts195@elasticsearch-ci-immutable-debian-8-1610362485218858680:~$ grep -i cgroup /boot/config-3.16.0-11-amd64 
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-8-1610362485218858680:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 2 root root  0 Jan 11 10:58 blkio
lrwxrwxrwx 1 root root 11 Jan 11 10:56 cpu -> cpu,cpuacct
dr-xr-xr-x 2 root root  0 Jan 11 10:58 cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 11 10:56 cpuacct -> cpu,cpuacct
dr-xr-xr-x 2 root root  0 Jan 11 10:58 cpuset
dr-xr-xr-x 2 root root  0 Jan 11 10:58 devices
dr-xr-xr-x 2 root root  0 Jan 11 10:58 freezer
lrwxrwxrwx 1 root root 16 Jan 11 10:56 net_cls -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Jan 11 10:58 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 11 10:56 net_prio -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Jan 11 10:58 perf_event
dr-xr-xr-x 4 root root  0 Jan 11 10:58 systemd

Debian 9:

droberts195@elasticsearch-ci-immutable-debian-9-1610467520318032810:~$ grep -i cgroup /boot/config-4.9.0-14-amd64 
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-9-1610467520318032810:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 6 root root  0 Jan 12 16:09 blkio
lrwxrwxrwx 1 root root 11 Jan 12 16:07 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 12 16:07 cpuacct -> cpu,cpuacct
dr-xr-xr-x 6 root root  0 Jan 12 16:09 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Jan 12 16:09 cpuset
dr-xr-xr-x 6 root root  0 Jan 12 16:09 devices
dr-xr-xr-x 3 root root  0 Jan 12 16:09 freezer
dr-xr-xr-x 6 root root  0 Jan 12 16:09 memory
lrwxrwxrwx 1 root root 16 Jan 12 16:07 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:09 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 12 16:07 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:09 perf_event
dr-xr-xr-x 6 root root  0 Jan 12 16:09 pids
dr-xr-xr-x 6 root root  0 Jan 12 16:09 systemd

Debian 10:

droberts195@elasticsearch-ci-immutable-debian-10-1610467306619099145:~$ grep -i cgroup /boot/config-4.19.0-13-cloud-amd64 
CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_BLK_CGROUP_IOLATENCY is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=m
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

droberts195@elasticsearch-ci-immutable-debian-10-1610467306619099145:~$ ls -l /sys/fs/cgroup
total 0
dr-xr-xr-x 5 root root  0 Jan 12 16:04 blkio
lrwxrwxrwx 1 root root 11 Jan 12 16:02 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Jan 12 16:02 cpuacct -> cpu,cpuacct
dr-xr-xr-x 5 root root  0 Jan 12 16:04 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Jan 12 16:04 cpuset
dr-xr-xr-x 5 root root  0 Jan 12 16:04 devices
dr-xr-xr-x 3 root root  0 Jan 12 16:04 freezer
dr-xr-xr-x 5 root root  0 Jan 12 16:04 memory
lrwxrwxrwx 1 root root 16 Jan 12 16:02 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:04 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Jan 12 16:02 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Jan 12 16:04 perf_event
dr-xr-xr-x 5 root root  0 Jan 12 16:04 pids
dr-xr-xr-x 2 root root  0 Jan 12 16:04 rdma
dr-xr-xr-x 6 root root  0 Jan 12 16:04 systemd
dr-xr-xr-x 5 root root  0 Jan 12 16:04 unified

CentOS 7:

[droberts195@elasticsearch-ci-immutable-centos-7-pkg-1610468233716122392 ~]$ grep -i cgroup /boot/config-3.10.0-1160.11.1.el7.x86_64 
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=y
CONFIG_NETPRIO_CGROUP=y

[droberts195@elasticsearch-ci-immutable-centos-7-pkg-1610468233716122392 ~]$ ls -l /sys/fs/cgroup
total 0
drwxr-xr-x. 4 root root  0 Jan 12 16:17 blkio
lrwxrwxrwx. 1 root root 11 Jan 12 16:17 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 Jan 12 16:17 cpuacct -> cpu,cpuacct
drwxr-xr-x. 4 root root  0 Jan 12 16:17 cpu,cpuacct
drwxr-xr-x. 2 root root  0 Jan 12 16:17 cpuset
drwxr-xr-x. 4 root root  0 Jan 12 16:17 devices
drwxr-xr-x. 2 root root  0 Jan 12 16:17 freezer
drwxr-xr-x. 2 root root  0 Jan 12 16:17 hugetlb
drwxr-xr-x. 4 root root  0 Jan 12 16:17 memory
lrwxrwxrwx. 1 root root 16 Jan 12 16:17 net_cls -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 Jan 12 16:17 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 Jan 12 16:17 net_prio -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 Jan 12 16:17 perf_event
drwxr-xr-x. 4 root root  0 Jan 12 16:17 pids

CentOS 6:

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ grep -i cgroup /boot/config-2.6.32-754.35.1.el6.x86_64 
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_PERF=y
CONFIG_NET_CLS_CGROUP=y
CONFIG_NETPRIO_CGROUP=y

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ ls -l /sys/fs/cgroup
ls: cannot access /sys/fs/cgroup: No such file or directory

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 ~]$ ls -l /cgroup
total 0

[droberts195@elasticsearch-ci-immutable-centos-6-1610467520318484491 cgroup]$ lscgroup
cgroups can't be listed: Cgroup is not mounted

There is a comment in https://stackoverflow.com/questions/21337522/trying-to-use-cgroups-in-debian-wheezy-and-no-daemons that "Debian disables the memory subsystem by default in the kernel, so you need to activate it if you need it". That dates back to Wheezy (Debian 7), so it appears that that statement was still true in Jessie (8). But by Stretch (9) it was no longer the case.

CentOS 6 appears not to mount cgroups by default, and this doesn't appear to confuse Java 8u271, so it must be that if the memory subsystem is missing but everything is missing then that's OK.

Ubuntu is based on Debian so I suspect the versions based on Debian 7/8 will suffer the same problem. Thankfully this doesn't impact our support matrix enormously, as the last such Ubuntu version was 15.10. Ubuntu 16.04 is based on Debian 9 and that's the oldest we support in ES 7.x. Based on this the only currently supported combination affected apart from Debian 8 would be ES 6.8 on Ubuntu 14.04.

To summarize, this problem affects:

ES 7.x and 6.8 on Debian 8
ES 6.8 on Ubuntu 14.04

Then the workarounds would be to either enable the memory subsystem - instructions in https://dawnbringer.net/blog/1033/cgroup%20support - or else upgrade Java to a fixed version.

Since failure to obtain the amount of memory on a node is really bad for ML we will document this as a known issue for ML.

droberts195 · 2021-01-13T12:09:38Z

The 4 affected tests will be selectively muted on Debian 8 when #67422 is merged and backported.

The selective muting implemented for autoscaling in #67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates #66885

The selective muting implemented for autoscaling in elastic#67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates elastic#66885 Backport of elastic#67422

The selective muting implemented for autoscaling in #67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates #66885 Backport of #67422

astefan · 2021-01-15T07:47:30Z

@droberts195 I've spotted this morning another failure for tests already mentioned in this issue. 7.11 with Debian 8. Judging by the investigation done and the code merge two days ago, there shouldn't be any more failures. For the failing tests, should we proactively mute them using willSufferDebian8MemoryProblem() assumption?

Build scan: https://gradle-enterprise.elastic.co/s/d4utquzs42lsm
A snippet from the error:

        java.lang.AssertionError: node didn't match expected value:
                                  node: expected String [] but was String [bqImCgWoQqST2HEEZE2T7g]
            at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:93)
            at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
            at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:409)

droberts195 · 2021-01-15T09:51:05Z

For the failing tests, should we proactively mute them using willSufferDebian8MemoryProblem() assumption?

The two failing tests are YAML tests. I don't know a way to selectively mute based on a complex condition in the skip section of a YAML test. It's probably best to mute them altogether to stop the noise. I believe a fix is being worked on for the underlying problem in #66629 so hopefully they won't have to be muted forever.

martijnvg · 2021-01-19T10:51:59Z

I noticed the following 3 yaml test failures today on Debian 8:

The tests (XPackRestIT » test {p0=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open}, MlWithSecurityIT » test {yaml=ml/jobs_crud/Test put job with model_memory_limit as string and lazy open} and XPackRestIT » test {p0=ml/ml_info/Test ml info}) have assertion failures which have been mentioned in this issue (tests in TooManyJobsIT no longer fail as these have been muted).

I think these failures are related to this issue. Would someone be able to confirm this?
I think #67681, would be great in order to selectively mute these tests for debian 8.

droberts195 · 2021-01-19T12:18:13Z

I think these failures are related to this issue. Would someone be able to confirm this?

Yes, all the test failures on Debian 8 that relate to memory being reported as zero are basically the same thing.

I think #67681, would be great in order to selectively mute these tests for debian 8.

👍

albertzaharovits · 2021-01-27T10:13:38Z

Tests still falling, eg https://gradle-enterprise.elastic.co/s/nmzh3f4juwn4e .
@hendrikmuhs I think it's worth pushing #67681 to completion.

cbuescher · 2021-01-28T11:55:32Z

Some tests fitting this issue still fail today on 7.10, e.g.
https://gradle-enterprise.elastic.co/s/m76jhemdg5bqq

Not sure if this should or shouldn't happen any more with elastic/infra#26251 being merged, maybe @albertzaharovits can see whether this is some different configuration/issue or not?

cbuescher · 2021-01-28T11:56:23Z

Also 7.x just now: https://gradle-enterprise.elastic.co/s/nndh4zezzcr7i

droberts195 · 2021-01-28T12:01:32Z

Today's failures are still using Java 8u271: you can search for 8u271 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-unix-compatibility/os=debian-8&&immutable/493/consoleText and https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/226/consoleText to see.

cbuescher · 2021-01-28T17:51:31Z

Looks like at least https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.11+multijob-unix-compatibility/os=debian-8&&immutable/86/ is using oracle-8u281 now but apparently still experiencing similar issues like reported here.

cbuescher · 2021-01-28T18:47:30Z

I muted the failing yml tests in 7.x, 7.11 and 7.10 now. Please remove the general skip on all plattforms when #67681 makes it possible to be more selective here.

cbuescher · 2021-01-28T18:48:27Z

fyi d975922, 4e19256 and 7bcb3fc

hendrikmuhs · 2021-01-29T10:22:44Z

Tests still falling, eg gradle-enterprise.elastic.co/s/nmzh3f4juwn4e .
@hendrikmuhs I think it's worth pushing #67681 to completion.

#67681 is ready for review

droberts195 · 2021-02-08T14:11:08Z

This should be fixed by #68542.

Unmute the YAML tests that were muted due to the problem of elastic#66885. The underlying problem was fixed by elastic#68542.

Unmute the YAML tests that were muted due to the problem of #66885. The underlying problem was fixed by #68542.

The selective muting implemented for autoscaling in elastic#67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates elastic#66885

original-brownbear added >test-failure Triaged test failures from CI :ml Machine learning v7.10.2 labels Dec 30, 2020

ywangd mentioned this issue Jan 5, 2021

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

Closed

droberts195 changed the title ~~Failures in ML TooManyJobsIT on 7.10~~ Failures in ML TooManyJobsIT on Debian 8 Jan 5, 2021

breskeby added v7.10.3 and removed v7.10.2 labels Jan 6, 2021

droberts195 removed the v7.10.3 label Jan 12, 2021

droberts195 self-assigned this Jan 12, 2021

droberts195 mentioned this issue Jan 13, 2021

Extend the selective muting of memory tests on Debian 8 #67422

Merged

droberts195 mentioned this issue Jan 13, 2021

Extend the selective muting of memory tests on Debian 8 #67453

Merged

droberts195 mentioned this issue Jan 13, 2021

Extend the selective muting of memory tests on Debian 8 #67454

Merged

henningandersen mentioned this issue Jan 13, 2021

OsProbeTests.testOsStats and memory-related YAML test failures on Debian 8 #66629

Closed

droberts195 mentioned this issue Jan 19, 2021

Add known issue for zero memory on Debian 8 elastic/stack-docs#1528

Merged

danhermann mentioned this issue Feb 4, 2021

Workaround for JDK bug with total mem on Debian8 #68542

Merged

mark-vieira added the Team:ML Meta label for the ML team label Feb 4, 2021

droberts195 closed this as completed Feb 8, 2021

This was referenced Feb 8, 2021

[7.x] Workaround for JDK bug with total mem on Debian8 (#68542) #68698

Merged

[7.11] Workaround for JDK bug with total mem on Debian8 (#68542) #68699

Merged

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Feb 22, 2021

[ML] Unmute memory-dependent ML YAML tests

19aff35

Unmute the YAML tests that were muted due to the problem of elastic#66885. The underlying problem was fixed by elastic#68542.

droberts195 mentioned this issue Feb 22, 2021

[ML] Unmute memory-dependent ML YAML tests #69333

Merged

droberts195 added a commit that referenced this issue Feb 22, 2021

[ML] Unmute memory-dependent ML YAML tests (#69333)

2fc4e11

Unmute the YAML tests that were muted due to the problem of #66885. The underlying problem was fixed by #68542.

droberts195 added a commit that referenced this issue Feb 22, 2021

[ML] Unmute memory-dependent ML YAML tests (#69333)

8f227a9

Unmute the YAML tests that were muted due to the problem of #66885. The underlying problem was fixed by #68542.

droberts195 added a commit that referenced this issue Feb 22, 2021

[ML] Unmute memory-dependent ML YAML tests (#69333)

783e090

Unmute the YAML tests that were muted due to the problem of #66885. The underlying problem was fixed by #68542.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in ML TooManyJobsIT on Debian 8 #66885

Failures in ML TooManyJobsIT on Debian 8 #66885

original-brownbear commented Dec 30, 2020

elasticmachine commented Dec 30, 2020

droberts195 commented Jan 4, 2021

droberts195 commented Jan 4, 2021

droberts195 commented Jan 4, 2021

ywangd commented Jan 5, 2021

ywangd commented Jan 5, 2021

ywangd commented Jan 5, 2021

droberts195 commented Jan 5, 2021

ywangd commented Jan 5, 2021

probakowski commented Jan 7, 2021

droberts195 commented Jan 7, 2021

henningandersen commented Jan 11, 2021 •

edited

droberts195 commented Jan 12, 2021

droberts195 commented Jan 13, 2021

astefan commented Jan 15, 2021 •

edited

droberts195 commented Jan 15, 2021

martijnvg commented Jan 19, 2021

droberts195 commented Jan 19, 2021

albertzaharovits commented Jan 27, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

droberts195 commented Jan 28, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

hendrikmuhs commented Jan 29, 2021

droberts195 commented Feb 8, 2021 •

edited

Failures in ML TooManyJobsIT on Debian 8 #66885

Failures in ML TooManyJobsIT on Debian 8 #66885

Comments

original-brownbear commented Dec 30, 2020

elasticmachine commented Dec 30, 2020

droberts195 commented Jan 4, 2021

droberts195 commented Jan 4, 2021

droberts195 commented Jan 4, 2021

ywangd commented Jan 5, 2021

ywangd commented Jan 5, 2021

ywangd commented Jan 5, 2021

droberts195 commented Jan 5, 2021

ywangd commented Jan 5, 2021

probakowski commented Jan 7, 2021

droberts195 commented Jan 7, 2021

henningandersen commented Jan 11, 2021 • edited

droberts195 commented Jan 12, 2021

droberts195 commented Jan 13, 2021

astefan commented Jan 15, 2021 • edited

droberts195 commented Jan 15, 2021

martijnvg commented Jan 19, 2021

droberts195 commented Jan 19, 2021

albertzaharovits commented Jan 27, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

droberts195 commented Jan 28, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

cbuescher commented Jan 28, 2021

hendrikmuhs commented Jan 29, 2021

droberts195 commented Feb 8, 2021 • edited

henningandersen commented Jan 11, 2021 •

edited

astefan commented Jan 15, 2021 •

edited

droberts195 commented Feb 8, 2021 •

edited