Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of NodeClient in BaseRestHandler of an ActionPlugin causes multi-node clusters to hang after version 7.9.0 #67960

Closed
davemoore- opened this issue Jan 25, 2021 · 10 comments
Labels
>bug :Distributed/Network Http and internode communication implementations >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team

Comments

@davemoore-
Copy link
Contributor

davemoore- commented Jan 25, 2021

Elasticsearch version: >= 7.9.0

Plugins installed: Any custom ActionPlugin that uses the NodeClient client from a BaseRestHandler.

JVM version: JDK 11

OS version: The test scripts (see below) used Docker CE 19.03.13 and docker-compose 1.27.4 on macOS 10.15.7

Description of the problem

When using a multi-node cluster, requests made by the NodeClient client from a BaseRestHandler in an ActionPlugin will cause Elasticsearch to hang.

Take this simple example (see entire sample plugin):

@Override
protected RestChannelConsumer prepareRequest(final RestRequest request, final NodeClient client) {
    return channel -> {
        try {
            client.prepareSearch("*").get(); // A search request made by the NodeClient.
            XContentBuilder content = XContentFactory.jsonBuilder();
            content.startObject().field("acknowledged", true).endObject();
            channel.sendResponse(new BytesRestResponse(RestStatus.OK, content));
        } catch (final Exception e) {
            channel.sendResponse(new BytesRestResponse(channel, e));
        }
    };
}

In this example, a simple search performed by the plugin's NodeClient will cause Elastisearch >= 7.9.0 to hang if that cluster has multiple nodes.

Any action, whether it's creating an index or performing a search, from a custom plugin appears to cause multi-node cluster to hang.

I've labeled this as a "bug" because the usage of the NodeClient in plugins has worked since at least Elasticsearch 5.x, and the expected behavior stopped working after 7.9.0 without an explanation in the release notes.

Steps to reproduce

I've provided a test script to simplify the reproduction of this issue.

Step 1. Download and extract my test script (elasticsearch-plugin-troubleshooting.zip) which contains:

  • ./post-7.9/docker-compose.yml - A 3-node Elasticsearch cluster at version 7.10.2 with a sample plugin installed
  • ./pre-7.9/docker-compose.yml - A 3-node Elasticsearch cluster at version 7.8.1 with a sample plugin installed
  • ./test.sh - A script to test plugin behavior in the two versions of Elasticsearch

Step 2. Test the behavior of the plugin on Elasticsearch 7.8.1. Open two terminals:

  • On Terminal 1:
    • cd pre-7.9
    • docker-compose up
  • On Terminal 2:
    • ./test.sh

Step 3. Observe the output. Notice that ./test.sh performs three successful tasks:

  • Creates and deletes an index (sample-index) using curl 50 times.
  • Creates and deletes an index (sample-index) using the custom plugin endpoint (/_sample/create) 50 times.
  • Searches an index using the custom plugin endpoint (_sample/search) 50 times.

Step 4. Test the behavior of the plugin on Elasticsearch 7.10.2. Open two terminals:

  • On Terminal 1:
    • cd post-7.9
    • docker-compose up
  • On Terminal 2:
    • ./test.sh

Step 5. Observe the output. Notice that ./test.sh performs hangs when the custom plugin is used, unlike Step 3:

  • Creates and deletes an index (sample-index) using curl 50 times.
  • Hangs when the custom plugin endpoint (/_sample/create) attempts to create an index (sample-index).
@DaveCTurner
Copy link
Contributor

If you run your cluster with assertions enabled, does your plugin throw an AssertionError?

@davemoore-
Copy link
Contributor Author

davemoore- commented Jan 26, 2021

@DaveCTurner Yes, here are the details:

  • I enabled assertions by adding -ea to ES_JAVA_OPTS for each node in my docker-compose.yml for Elasticsearch 7.10.2.
  • I ran the test script and the plugin threw this assertion error:
java.lang.AssertionError: Expected current thread [Thread[elasticsearch[es01][transport_worker][T#5],5,main]] to not be a transport thread. Reason: [Blocking operation]
es01    | 	at org.elasticsearch.transport.Transports.assertNotTransportThread(Transports.java:60)
es01    | 	at org.elasticsearch.common.util.concurrent.BaseFuture.blockingAllowed(BaseFuture.java:92)
es01    | 	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:86)
es01    | 	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
es01    | 	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
es01    | 	at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
es01    | 	at org.elasticsearch.plugin.sampleplugin.SampleCreateIndexAction.lambda$prepareRequest$0(SamplePlugin.java:67)
es01    | 	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:115)
es01    | 	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:258)
es01    | 	at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:340)
es01    | 	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:191)
[... omitted for brevity ...]
es01    | 	at java.base/java.lang.Thread.run(Thread.java:832)

Thank you for your quick response!

@davemoore-
Copy link
Contributor Author

Adding to this:

I enabled assertions for a pre-7.9 cluster, too, and the same assertion error was thrown. But without assertions enabled, the plugin worked as expected on that version of Elasticsearch.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 26, 2021

Ok that makes sense. Blocking calls in REST handlers, or transport threads more generally, aren't allowed as they have severe performance implications. That has been true for a very long time. The assertion that is tripping was added back in 1.5.0 (#9164) to warn you away from doing so. You should use the async API wherever possible, and fork to a different thread if async isn't possible for some reason.

I believe the reason this changed in 7.9 was #46346; you ought to be able to demonstrate this by setting http.netty.worker_count and observing that there is no longer a deadlock, but please note that you should only do this to confirm the cause of the change in behaviour and not to consider this a fix: a blocking call in a REST handler is a serious bug, as is tripping an assertion. You need to avoid that blocking call and to ensure that your tests all pass even with assertions enabled.

I'm not sure why the change in 7.9 wasn't mentioned in the release notes, I'll mark this as a doc bug to get that addressed.

@DaveCTurner DaveCTurner added :Distributed/Network Http and internode communication implementations >docs General docs changes and removed needs:triage Requires assignment of a team area label labels Jan 26, 2021
@elasticmachine elasticmachine added Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team labels Jan 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@davemoore-
Copy link
Contributor Author

davemoore- commented Jan 26, 2021

Thanks @DaveCTurner. I supposed the next question, then, is how to properly implement a REST handler that must wait for the results from one or more Elasticsearch requests before returning a response to the user. Should I field that question to discuss.elastic.co?

You need to avoid that blocking call and to ensure that your tests all pass even with assertions enabled.

++ Good advice, I wasn't aware of the ability to enable assertions in Elasticsearch.

@DaveCTurner
Copy link
Contributor

how to properly implement a REST handler that must wait for the results from one or more Elasticsearch requests before returning a response to the user. Should I field that question to discuss.elastic.co?

Yes the forums.would be a better forum for that question.

@davemoore-
Copy link
Contributor Author

++ I've submitted my follow-up question to the discussion boards. Thank you for your help.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jan 27, 2021
In elastic#46346 we changed the distribution of work across event loops but
this was marked as a `>non-issue` so did not get a mention in the
release notes. However, plugin authors might need to be aware of this
change, so this commit records it in the release notes as an enhancement
instead.

Closes elastic#67960
DaveCTurner added a commit that referenced this issue Feb 8, 2021
In #46346 we changed the distribution of work across event loops but
this was marked as a `>non-issue` so did not get a mention in the
release notes. However, plugin authors might need to be aware of this
change, so this commit records it in the release notes as an enhancement
instead.

Closes #67960
DaveCTurner added a commit that referenced this issue Feb 8, 2021
In #46346 we changed the distribution of work across event loops but
this was marked as a `>non-issue` so did not get a mention in the
release notes. However, plugin authors might need to be aware of this
change, so this commit records it in the release notes as an enhancement
instead.

Closes #67960
DaveCTurner added a commit that referenced this issue Feb 8, 2021
In #46346 we changed the distribution of work across event loops but
this was marked as a `>non-issue` so did not get a mention in the
release notes. However, plugin authors might need to be aware of this
change, so this commit records it in the release notes as an enhancement
instead.

Closes #67960
DaveCTurner added a commit that referenced this issue Feb 8, 2021
In #46346 we changed the distribution of work across event loops but
this was marked as a `>non-issue` so did not get a mention in the
release notes. However, plugin authors might need to be aware of this
change, so this commit records it in the release notes as an enhancement
instead.

Closes #67960
@DaveCTurner
Copy link
Contributor

Closed by #68073.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Network Http and internode communication implementations >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team
Projects
None yet
Development

No branches or pull requests

3 participants