Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticsearchContainer is not waiting for container being ready #2982

Closed
ghost opened this issue Jul 8, 2020 · 11 comments
Closed

ElasticsearchContainer is not waiting for container being ready #2982

ghost opened this issue Jul 8, 2020 · 11 comments
Labels

Comments

@ghost
Copy link

ghost commented Jul 8, 2020

In /modules/elasticsearch/src/main/java/org/testcontainers/elasticsearch/ElasticsearchContainer.java#L55 the ElasticsearchContainer is waiting for either of two status codes:

.forStatusCodeMatching(response -> response == HTTP_OK || response == HTTP_UNAUTHORIZED)

But ES returning UNAUTHORIZED means that it is not 100% ready yet. Due to this, a simple test relying on the default ElasticsearchContainer without any changes and trying to use its API fails in about 10%-20% of invocations with errors like

org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=security_exception, reason=missing authentication token for REST request 

or if one adds default ES user credentials with

  org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=security_exception, reason=failed to authenticate user [elastic]]

The current workaround is (Scala code):

setWaitStrategy(
  getWaitStrategy.asInstanceOf[HttpWaitStrategy] // Get the current HttpWaitStrategy
    .forStatusCodeMatching(null) // Clear its StatusCodeMatching predicate.
    .forStatusCode(HTTP_OK)) // Set a single HTTP_OK StatusCode

This fixes the flaky test and I believe this should be the default in the ElasticsearchContainer.

@ghost
Copy link
Author

ghost commented Jul 9, 2020

It is worth pointing out that this line was originally copy-pasted from Couchbase and allowing for the UNAUTHRIZED response here was not anything deliberate or stemming from ElasticSearch implementation details.

See the conversation which accompanied the change: #826 (comment)

@raynigon
Copy link
Contributor

raynigon commented Jul 9, 2020

@lvorona i think this will not work, if elasticsearch security is enabled. Best way would be to call the cluster health api with credentials.

@ghost
Copy link
Author

ghost commented Jul 9, 2020

The user who enables elasticsearch security can still override the defaults similar to the way I have posted in the initial comment. This is only to make the defaults to work.

Currently a user who has a simple single container test has to add extra code. It would make sense to change the defaults to work out of the box and document the changes needed when security is enabled.

Perhaps I could change the health check to take security into account as well. Let me take a look.

@rnorth
Copy link
Member

rnorth commented Jul 13, 2020

Thanks for the PR, @slovdahl. I'm not sure if the original line just a straight copy and paste from the Couchbase module (at least that's not my interpretation of #826 (comment))

@dadoonet, could you perhaps lend your opinion on the suggested approach in this ticket and the PR (#2983)?

@slovdahl
Copy link
Contributor

Just for the record, I did not submit the PR 😄 I just added a proper cross-reference.

@ghost
Copy link
Author

ghost commented Jul 14, 2020

Thank you for fixing the PR.

I have pushed one more commit that points the HttpWaitStrategy at the dedicated ElasticSearch health endpoint 1: /_cluster/health

It took me a while to figure out the container's behaviour, but I have tested with various settings and waiting for the 200 on the dedicated healthcheck endpoint behaved the best.

Some important notes:

  • The 401 response mentioned in the ticket is not reproducible with the ES version used in this module by default (6.4.1)
  • The 401 response is reproducible with both 6.2.4 (used in our real code) and 6.8.0 - a slightly more recent version, which support authentication.
  • To reproduce the issue the system must be under some noticeable load. Even then, it takes several runs of the loop while mvn package; do echo -e "\n\nAnother one...\n\n" ; done to fail in the test.
  • Test with security enabled according to the elastic blog 2 worked, the health check endpoint itself does not require authentication
  • Simulating network slowness (sudo tc qdisc add dev docker0 root netem delay 500ms) may help to make reproducing this issue. I did it, not sure if it helped much though.

As a reminder, this issue came out of us looking for the root of a unit test flakiness. It does not fail all the time, but I have a small project (a pom.xml and a single unit test) that allowed me to reproduce the issue. Let me know if you want it, I'll share the code.

@ghost
Copy link
Author

ghost commented Jul 14, 2020

Actually, I was not enabling the authentication properly. The Cluster health check requires authentication, and I had to add the following clause for the test to pass:

// To properly enable the security
container.withEnv("ELASTIC_PASSWORD", "secret");
container.withEnv("xpack.security.enabled", "true");

// The line to add credentials to the wait strategy
.withBasicCredentials("elastic", "secret")

I feel like this is still an improvement over the test being flacky. I could add this snippet to the documentation (https://www.testcontainers.org/modules/elasticsearch/), but I need help to know which repository to update for that.

@dadoonet
Copy link
Contributor

I feel like this is still an improvement over the test being flacky. I could add this snippet to the documentation (https://www.testcontainers.org/modules/elasticsearch/), but I need help to know which repository to update for that.

Hey. I think this is what I'm bringing with #2320 hopefully.

@ghost
Copy link
Author

ghost commented Jul 15, 2020

Great, then the user will not have to override the wait strategy, since your container already knows if the credentials are needed or not.

Feel free to incorporate this change into your PR if that would make it easier for you to manage the change.

@stale
Copy link

stale bot commented Oct 17, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you believe this is a mistake, please reply to this comment to keep it open. If there isn't one already, a PR to fix or at least reproduce the problem in a test case will always help us get back on track to tackle this.

@stale stale bot added the stale label Oct 17, 2020
@ghost
Copy link
Author

ghost commented Oct 18, 2020

This has been resolved (see comments above).

@ghost ghost closed this as completed Oct 18, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants