Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RestSqlIT testAsyncTextPaginated failing #80089

Open
droberts195 opened this issue Oct 29, 2021 · 9 comments · May be fixed by #107955
Open

[CI] RestSqlIT testAsyncTextPaginated failing #80089

droberts195 opened this issue Oct 29, 2021 · 9 comments · May be fixed by #107955
Assignees
Labels
:Analytics/SQL SQL querying low-risk An open issue or test failure that is a low risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

droberts195 commented Oct 29, 2021

Although the same test, this failure differs from #76785, because that was a timeout whereas this one is "all shards failed". I wonder if this is yet another manifestation of #65846?

Build scan:
https://gradle-enterprise.elastic.co/s/xehunhrnkfhhi/tests/:x-pack:plugin:sql:qa:server:security:with-ssl:integTest/org.elasticsearch.xpack.sql.qa.security.RestSqlIT/testAsyncTextPaginated

Reproduction line:
./gradlew ':x-pack:plugin:sql:qa:server:security:with-ssl:integTest' --tests "org.elasticsearch.xpack.sql.qa.security.RestSqlIT.testAsyncTextPaginated" -Dtests.seed=B1D5C515E1E4486D -Dtests.locale=hr-HR -Dtests.timezone=Asia/Irkutsk -Druntime.java=17

Applicable branches:
master, 8.0, 7.16

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.sql.qa.security.RestSqlIT&tests.test=testAsyncTextPaginated

Failure excerpt:

org.elasticsearch.client.ResponseException: method [GET], host [https://[::1]:45613], URI [/_sql/async/FkZhZEVPa2VyUmphUlRzQW5mMWYtX2cbb0U0bzB2UGlSWnVGZ2RRV2ZlU3dnQTo0MDMz?format=txt&wait_for_completion_timeout=1d], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [45]"}},"status":404}

  at __randomizedtesting.SeedInfo.seed([B1D5C515E1E4486D:FB65C4A44320DED2]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:335)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:276)
  at org.elasticsearch.xpack.sql.qa.rest.RestSqlTestCase.testAsyncTextPaginated(RestSqlTestCase.java:1344)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

@droberts195 droberts195 added :Analytics/SQL SQL querying >test-failure Triaged test failures from CI labels Oct 29, 2021
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Oct 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@davidkyle
Copy link
Member

There are 4 test classes with the name RestSqlIT

  • org.elasticsearch.xpack.sql.qa.security.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.multi_cluster_with_security.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.single_node.RestSqlIT
  • org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT

All have had failures in the last 7 days apart from org.elasticsearch.xpack.sql.qa.multi_node.RestSqlIT

https://build-stats.elastic.co/app/kibana#/discover?_g=h@33d9122&_a=h@19cd33a

The test is muted in the base class in #80126

@bpintea
Copy link
Contributor

bpintea commented Nov 1, 2021

The failures won't reproduce locally, but all CI failures contain the warning There are still tasks running after this test that might break subsequent tests (also not reproducible locally) after every test preceding this failing one, which I suppose has to do with the test failing.

@droberts195
Copy link
Contributor Author

The leftover task is indices:data/read/sql/async/get.

What might be happening is:

  1. The previous test leaves an async search in progress when it completes
  2. The cleanup that runs between tests deletes all indices
  3. The leftover async search causes the async search results index to be created
  4. This test runs a search that then suffers all shards failed because the async search results index exists but is not ready to search yet, i.e. the problem of Allow Ignoring failures for indices that have just been created #65846

If you agree then the fix could be to wait for pending tasks to complete at the end of the previous test. If you search for waitForPendingTasks in IntelliJ you'll find a few places where ML does this in order to stop spillover from one test to the next.

Please also upvote #65846.

@bpintea
Copy link
Contributor

bpintea commented Nov 2, 2021

because the async search results index exists but is not ready to search yet

Maybe this scenario could happen, there are failures occurring when fetching the async results; but there are also failures triggered even by fetching the async task's status (like here and here), or even for starting the async job (like here). Then all failures of this test are 404s vs 503s (as in #65846) and the reason, when available is No search context found for id.

And it also seems these types of failures for this test started past the 27th last month (there are a couple of other failures, but time out-related).

So it might still be the suggested root cause, but I'd like to poke around a bit more.

@costin costin added the low-risk An open issue or test failure that is a low risk to future releases label Oct 12, 2023
@wchaparro wchaparro removed the Team:QL (Deprecated) Meta label for query languages team label Jan 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@craigtaverner
Copy link
Contributor

I tested both the failures reported in #76785 as well as reported here for both main (8.15) and 7.17 and all passed locally. However, in one of the above comments it was pointed out that these failures only occur in CI, not locally, and they appear to be unclosed previous async queries. So I'm making a PR that unmutes the test as well as adds the suggested waitForPendingTasks(adminClient(), taskName -> taskName.startsWith("indices:data/read/sql/async/get"));.

craigtaverner added a commit to craigtaverner/elasticsearch that referenced this issue Apr 26, 2024
craigtaverner added a commit to craigtaverner/elasticsearch that referenced this issue Apr 26, 2024
@craigtaverner craigtaverner linked a pull request Apr 26, 2024 that will close this issue
@craigtaverner
Copy link
Contributor

To see both 8.15 and 7.17 running in CI, I made two PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/SQL SQL querying low-risk An open issue or test failure that is a low risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants