Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all shards failed [type=search_phase_execution_exception]\nSearch service in Jaeger Query UI #2976

Open
raz08 opened this issue May 3, 2021 · 26 comments

Comments

@raz08
Copy link

raz08 commented May 3, 2021

Describe the bug
In Jaeger UI there is an error "all shards failed" and it is not loading traces in UI
To Reproduce
Steps to reproduce the behavior:

  1. Deploy elastic and Jaeger
  2. Restart elastic cluster
  3. After restart "All shards failed error"

Expected behavior
Jaeger Query UI should provide appropriate message in UI and it should auto recover once elastic is in green state
Screenshots
{"level":"error","ts":1616554296.4920769,"caller":"app/http_handler.go:410","msg":"HTTP handler, Internal Server Error","error":"Search service failed: elastic: Error 503 (Service Unaviled [type=search_phase_execution_exception]","errorVerbose":"elastic: Error 503 (Service Unavailable): all shards failed [type=search_phase_execution_exception]\nSearch service

Version (please complete the following information):

  • OS: [e.g. Linux]
  • Jaeger version: 1.15
  • Deployment: [Kubernetes
  • Elastic : 7.3.2

What troubleshooting steps did you try?
Try to follow https://www.jaegertracing.io/docs/latest/troubleshooting/ and describe how far you were able to progress and/or which steps did not work.

Additional context
Does upgrading Jaeger helps in fixing this issue? if not how to solve this issue?

@raz08 raz08 added the bug label May 3, 2021
@raz08 raz08 changed the title Error 503 (Service Unaviled [type=search_phase_execution_exception]","errorVerbose":"elastic: Error 503 (Service Unavailable): all shards failed [type=search_phase_execution_exception]\nSearch service all shards failed [type=search_phase_execution_exception]\nSearch service in Jaeger Query UI May 3, 2021
@yurishkuro
Copy link
Member

Similar to #2718

@pavolloffay
Copy link
Member

This should not happen since we added creation of index templates to Jaeger startup.

Did this happen with clean Jaeger installation or after the upgrade?

@pavolloffay
Copy link
Member

@raz08 if it is resolved please add more details how it was resolved and close the issue.

@meilihao
Copy link

I got the same error after jaeger 1.22 -> 1.25, and the bad request is curl http://openhello.net:16686/api/services.

It works after clean old data by curl -X DELETE 'http://localhost:9200/_all'.

@pavolloffay
Copy link
Member

@meilihao are these two Jaeger versions using the same ES cluster?

@meilihao
Copy link

@pavolloffay I'm not sure. i used ES by docker with jaeger 1.22 before, today I removed docker and use ES by apt with jaeger 1.25. I don't remember that /var/lib/elasticsearch was mounted in docker.

@nbari
Copy link

nbari commented Nov 3, 2021

I have this issue with version 1.27.0 and OpenSearch, is there any flag that can be passed to the collector or the UI to initialize elk/OpenSearch?

@henderjm
Copy link

henderjm commented Dec 6, 2021

I got the same error after jaeger 1.22 -> 1.25, and the bad request is curl http://openhello.net:16686/api/services.

It works after clean old data by curl -X DELETE 'http://localhost:9200/_all'.

This was the only way I was able to get our deployment working again. Sadly costs data but we were lucky it was just our dev environment.

@pavolloffay
Copy link
Member

is there any flag that can be passed to the collector or the UI to initialize elk/OpenSearch?

On each Jaeger start the collector makes a request to the ES to create index templates.

@Ankitchandre
Copy link

We are facing the same issue . do we have any suggestion.solution ? we tried deleting indexes and redeploy jaeger operator, didn't work.

@korenlev
Copy link

facing the same issue with 1.28

@cablunar
Copy link

Having the same issue with 1.38.0 after upgrading to a new opensearch instance.
Reverting doesn't help... :/

@ihatemodels
Copy link

Facing the same issue on all version from 1.30.0 to 1.38.0
Deleting and revert mapping doesn't work ...

@PaulFlorea
Copy link

I had to roll back to v1.21 to get it to work again.
Every jaeger version from v1.22 onward is broken with the ES backend, if any of them write to ES it throws that error until the indices are deleted.

@v-sag
Copy link

v-sag commented Jan 24, 2023

Is there any update on this? I get the same error.

@dolgovas
Copy link

dolgovas commented Aug 8, 2023

Is there any update on this? I get the same error.
fixed by
kubegems/kubegems#413 (comment)

@K3ndu
Copy link

K3ndu commented Aug 15, 2023

Any updates about this?
We are getting this error the moment I create templates for jaeger indexes.
Opensearch version is 2.9.0
Jaeger is 1.45.0

@stmlange
Copy link

stmlange commented Sep 18, 2023

Still happening with jaeger 1.49.

Edit: Only option I found that seems to fix it, was mentioned in kubegems/kubegems#413:

Resolved by shutdown all jaeger collectors, delete the bad indexes, then restart the collectors.
So old trace data will lost.

@gjshao44
Copy link

gjshao44 commented Nov 9, 2023

In my case, I have Jaeger conneting with AWS opensearch. I had this issue first: #3571 (comment), so I added ES_CREATE_INDEX_TEMPLATES = "false" flag. Then I experienced the same issue with Jaeger query mentioned in this thread, I uploaded both jaeger-span-7.json and jaeger-service-7.json (with a minor modification from this repo to remove the micros) manually to opensearch, based on opensearch documentation: https://opensearch.org/docs/latest/im-plugin/index-templates/. Next I applied the workaround mentioned above to finally made it work. I don't think either issue is fixed.

@pip25
Copy link

pip25 commented Jan 3, 2024

Just occurred on 1.50 as well, right after we deployed Jaeger to production a month ago. Incredibly embarrassing (for us). I get the feeling that most "fixes" mentioned here and in related issues merely mask the problem, since they tend to involve deleting the previous data. Unfortunately, deleting the traces is not an option for us, since they also serve auditing purposes.

I'll look at the source to see what kind of queries Jaeger sends towards ES, because the indexes can be queried from Kibana just fine, both new and old. The new traces also continue to be written successfully.

@pip25
Copy link

pip25 commented Jan 3, 2024

Got it. It's the service and operation queries that fail on the UI. The service query is the following:

GET jaeger-service-2024-01-03/_search?ignore_unavailable=true&rest_total_hits_as_int=true
{
"aggregations":{"distinct_services":{"terms":{"field":"serviceName","size":1000}}},"size":0
}

And the error is:

"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [serviceName] in order to load field data by uninverting the inverted index. Note that this can use significant memory."

Index templates created by Jaeger seem to be applied normally to the indexes, the only deviation we have from the default config is an additional index template for adding a custom ILM policy (since we don't use aliases).

As for the error message itself, there seems to be little point to adding "fielddata" to the template, since we already have a keyword, and by changing "field":"serviceName" to "field":"serviceName.keyword", the call succeeds. But then I can't help but wonder: how did this query ever work in the first place? Why does it work for others after deleting the previous data?

EDIT: This is unbelievable. On our test environment, where the Jaeger UI still works, the same query succeeds without issues. The index mappings seem identical. The data itself seems structurally identical as well. And if I substitute "field":"serviceName" to "field":"serviceName.keyword" on that environment, the query result is empty! Could this be an ES issue? Does something significant change in the way aggregations are handled when one is using an ILM policy? (Which is basically the sole difference between the two environments.)

@pip25
Copy link

pip25 commented Jan 4, 2024

I managed to fix the problem. Here's what has been the cause for us; I hope it'll be useful for others as well in the future.

First of all, this is apparently not a bug in Jaeger, but an ES configuration issue.

ElasticSearch can apply only a single template to a newly created index. This is based on the templates' priority, and the templates created by Jaeger are put into the "legacy templates" category in ES 7, which unfortunately means that they have the absolute lowest priority among the templates. If you define any other template with an index pattern that overlaps these Jaeger templates, that will be used instead, and you will likely end up with an index with missing or incorrect mapping settings.

In our case, the template that was used instead of Jaeger's had dynamic mapping enabled, which means that ES autocreates mapping definitions from the incoming data. This is how our service definition indexes had a mapping for the serviceName field mentioned above. But unfortunately, the autocreated mapping differs from the format Jaeger expects the serviceName to be present in: it wants the field to be a keyword, while in the autocreated mapping, serviceName is a text field, with a sub-property named (and typed as) keyword. The aggregation ES query Jaeger uses to get the service names requires a keyword field, which is what causes the UI to report an error.

What is truly insidious about this issue is that, if you introduce the index template that overrides Jaeger's while some Jaeger indexes already exist, the problem does not manifest itself immediately. This is because Jaeger usually queries multiple indexes at once, based on the value of the es.max-span-age parameter. As long as even one index in Jaeger's "query window" contains the expected mappings, the UI will seemingly function as normal; in the background, part of the service/operation queries will fail, but as long as at least one index returns meaningful results, Jaeger will not complain. If there is one thing Jaeger could perhaps do better in such a situation, it's to at least report a warning if some shards return an error during the query, to let users know that something is amiss. This will enable them to find the issue a lot more easier than the Jaeger UI suddenly breaking multiple days, or even weeks after the problematic index template was introduced.

TL;DR: Make sure Jaeger's index templates are not overridden. If they are, the UI won't fail straight away, but it will eventually.

@yurishkuro
Copy link
Member

yurishkuro commented Jan 4, 2024

@pip25 thanks for a great analysis and write-up. Note that recently we added an ability to specify which priorities to use when creating Jaeger indices (ESv8 only):

 --es.prioirity-dependencies-template | 0 | Priority of jaeger-dependecies index template (ESv8 only) 
 --es.prioirity-service-template      | 0 | Priority of jaeger-service index template (ESv8 only) 
 --es.prioirity-span-template         | 0 | Priority of jaeger-span index template (ESv8 only) 

Is there something else we could add to alleviate this specific issue?

@pip25
Copy link

pip25 commented Jan 4, 2024

@yurishkuro Thanks, we're currently stuck on ESv7, but that is good to know.

Is there something else we could add to alleviate this specific issue?

As I wrote in the above wall of text :), it may make this configuration problem easier to spot if Jaeger did not silently swallow query errors in cases when only some shards fail (and thus some meaningful result is in fact returned). In such cases, some kind of warning message would be useful.

@asus4you
Copy link

Hi @pip25
thanks for the great analysis but what is the final solution to get it working? we have an index mapping file from ES end but the service field is set to "keyword" already in ES index mapping but still we are getting the same error.

@pip25
Copy link

pip25 commented Mar 21, 2024

@ksai2389 If your issue is what I've described above, you need to delete the problematic ES indexes with the wrong mappings set, then disable/modify the index templates that introduced the wrong mappings in the first place. If only Jaeger's templates can be applied to Jaeger's indexes, from then on your queries should be working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests