Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TraceID listed in server logs cannot be found in Grafana Tempo #594

Closed
4 tasks
pdeubel opened this issue Jul 12, 2023 · 6 comments
Closed
4 tasks

TraceID listed in server logs cannot be found in Grafana Tempo #594

pdeubel opened this issue Jul 12, 2023 · 6 comments
Labels

Comments

@pdeubel
Copy link

pdeubel commented Jul 12, 2023

System Info

HF-TGI server running on Kubernetes, I executed text-generation-launcher --env inside the pod:

2023-07-12T12:58:48.739266Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 31b36cca21fcd0e6b7db477a7545063e1b860156
Docker label: sha-31b36cc
nvidia-smi:
Wed Jul 12 12:58:48 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100 80G...  On   | 00000001:00:00.0 Off |                    0 |
   | N/A   35C    P0    71W / 300W |  46936MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-07-12T12:58:48.739312Z  INFO text_generation_launcher: Args { model_id: "OpenAssistant/falcon-40b-sft-mix-1226", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 512, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "production-hf-text-generation-inference-6594cb8f5d-z4mdf", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: Some("tempo.monitoring:4317"), cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: true }

Model being used:

{
  "model_id": "OpenAssistant/falcon-40b-sft-mix-1226",
  "model_sha": "9ac6b7846fabe144646213cf1c6ee048b88272a7",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 512,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 16000,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.9.1",
  "sha": "31b36cca21fcd0e6b7db477a7545063e1b860156",
  "docker_label": "sha-31b36cc"
}

Hardware used (GPUs, how many, on which cloud) (nvidia-smi): nvidia-smi see above, runs on an Azure Kubernetes Cluster VM spec: Standard_NC24ads_A100_v4

Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Runs on an AKS and is installed through a helm chart
The current version being used: 0.9.1

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I installed the HF TGI server, Prometheus, Grafana, Loki, and Grafana Tempo on Kubernetes. The latter four are in namespace monitoring and the HF TGI server is in namespace hf-tgi. HF-TGI is created with the following environment variable set: OTLP_ENDPOINT: "tempo.monitoring:4317", i.e. it references the service tempo in namespace monitoring on port 4317. Service is up and running.

So far this works fine, in Grafana under "Explore", I can select "Tempo", click on "Search" and Run Query. It then finds a lot of traces, mostly from target /Health, sometimes /Decode:
image

Now, when I go to "Explore", select "Loki", and then query the logs from the HF TGI pod, I can see the info messages like in stdout on the server itself. In the messages there is an entry in the JSON called "spans[0].trace_id". When I use the value from that field and search that in "Explore" -> "Tempo" -> TraceQL, I get an error message that the trace was not found:
failed to get trace with id: XXXX Status: 404 Not Found Body: trace not found

image

Expected behavior

My expected behavior would be: TraceIDs listed in the info messages on the server should point to a trace that exists.

However, I am new to tracing (and the Prometheus-Grafana-etc. stack) so my question is also if I am misconfiguring something here. I think it is a bug because I can see some traces but the TraceID from the info log cannot be found.

@OlivierDehaene
Copy link
Member

It's possible that you are just missing the http:// header: OTLP_ENDPOINT: http://tempo.monitoring:4317.
The traces you see are from the Python server, but it doesn't seem to collect the traces from the rust webserver (which is the one displaying the trace ID in the logs).

@pdeubel
Copy link
Author

pdeubel commented Jul 12, 2023

Many thanks for the fast response, that worked I now get the traces from the rust webserver. However when searching the TraceIDs from the logs in Tempo I still get 404 errors. On the other hand, when exploring all traces I can find the POST /generate span and a trace for. But the TraceIDs there are different to the one from the logs.

For example, this is from the log of a request:

image

That TraceID cannot be found

image

However the corresponding trace that I found under "Explore" -> "Tempo" has a different ID (the second one is from another request):

image

Is that still a misconfiguration on my part?

@pdeubel
Copy link
Author

pdeubel commented Aug 7, 2023

Sorry to bother again but I did not resolve this. I still get 404 errors on the Trace IDs that I find in the log. Do you have any advice on what I could try?

@pdeubel
Copy link
Author

pdeubel commented Sep 5, 2023

So I did some more digging and by enabling log_received_spans and log_received_spans.include_all_attributes in Grafana Tempo I logged what spans arrive in Tempo. I extracted the corresponding log entry for a call to /generate and it seems like there are three IDs, span_trace_id, spanid, and traceid:

level=info ts=2023-09-05T12:49:30.2719584Z caller=distributor.go:591 span_service_name=text-generation-
inference.router span_http_target=/health span_http_user_agent=kube-probe/1.26 
span_trace_id=977145f925d16e8235839a5b6ed18e83 span_http_status_code=200 span_busy_ns=403196 
span_code_lineno=161 span_thread_name=tokio-runtime-worker span_http_route=/health span_idle_ns=2811510 
span_http_flavor=1.1 span_code_filepath=/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/axum-tracing-
opentelemetry-0.10.0/src/middleware/trace_extractor.rs span_http_method=GET 
span_code_namespace=axum_tracing_opentelemetry::middleware::trace_extractor span_http_host=X 
span_http_scheme=HTTP span_thread_id=14 span_http_client_ip= span_name="GET /health" 
span_duration_seconds=0.003230504 span_kind=SPAN_KIND_SERVER span_status=STATUS_CODE_OK msg=received
spanid=5e65f3ea6125c531 traceid=8b6a7418f11718fbc86cff5cbdf60e81

span_trace_id is the only one that is displayed in the logs of TGI and this is the one that I cannot find in Grafana when I search the Tempo datasource. spanid and traceid are visible in Grafana, so they work, but have no connection to span_trace_id.

I tried adding some formatting or rather extraction to your router code to include spanid and traceid in the logs but I did not get very far since I never programmed in Rust. I would be happy if you could implement something that explicitly includes spanid and traceid in the TGI logs. I found some people that already tried doing that using telemetry and tokio: tokio-rs/tracing#1531

Other than that I will try to figure out if Grafana and/or Tempo can be configured to extract the correct IDs since it gets all three IDs as seen in the log, or I switch from Tempo to Jaeger which might not have this problem.

@OlivierDehaene
Copy link
Member

That's very interesting thanks for sharing. It seems to be a bug in the code that generates the traceID. We never run into this issue on our side (since we always receive a traceID inside the header of the request) so I'm not sure where the bug is exactly. I will create a PR with a fix once I find it.

Copy link

github-actions bot commented May 2, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 2, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants