Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces per-message structured GenAI events #980

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lmolkova
Copy link
Contributor

Fixes #834

Changes

Defines gen-ai specific events along with their structure.
Related to #954, #829

Merge requirement checklist

| [`gen_ai.usage.completion_tokens`](../attributes-registry/llm.md) | int | The number of tokens used in the LLM response (completion). | `180` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| [`gen_ai.usage.prompt_tokens`](../attributes-registry/llm.md) | int | The number of tokens used in the LLM prompt. | `100` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** The name of the LLM a request is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model requested. If the LLM is a fine-tuned custom model, the value should have a more specific name than the base model that's been fine-tuned.

**[2]:** If not using a vendor-supplied model, provide a custom friendly name, such as a name of the company or project. If the instrumetnation reports any attributes specific to a custom model, the value provided in the `gen_ai.system` SHOULD match the custom attribute namespace segment. For example, if `gen_ai.system` is set to `the_best_llm`, custom attributes should be added in the `gen_ai.the_best_llm.*` namespace. If none of above options apply, the instrumentation should set `_OTHER`.

**[3]:** If available. The name of the LLM serving a response. If the LLM is supplied by a vendor, then the value must be the exact name of the model actually used. If the LLM is a fine-tuned custom model, the value should have a more specific name than the base model that's been fine-tuned.
**[3]:** If there is more than one finish reason in the response, the last one should be reported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it recommend to only report one finish reason even when there are multiple response candidates returned by the model?

Copy link
Contributor Author

@lmolkova lmolkova May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual finish reasons is reported in individual events for each choice.

I decided to change finish reason to a single string since:

  • It seems only(?) OpenAI returns more than one choice and it's not quite a popular scenario (most samples are based on the assumption that n=1)
  • I could not really simulate a case when different finish reasons would be returned

I.e. in most cases there will be one choice or one reason on the span.
What if it's not enough?

The alternatives are:

  1. remove finish reason from the span altogether, but then it won't even be available on metrics
  2. keep an array attribute. It won't be usable on metrics and in hard to use in custom queries
  3. join all finish reasons to a single string and keep a comma-separated list sorted by the choice index.

I think options 1 and 2 are bad since they don't allow to use finish reason on metrics.

Option 3 seems reasonable though.

docs/gen-ai/llm-spans.md Outdated Show resolved Hide resolved
docs/gen-ai/llm-spans.md Outdated Show resolved Hide resolved
<!-- endsemconv -->

## Events

In the lifetime of an LLM span, an event for prompts sent and completions received MAY be created, depending on the configuration of the instrumentation.
In the lifetime of an GenAI span, an event for each message application sends to GenAI and receives in response from it MAY be created, depending on the configuration of the instrumentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered the debugging experience when an app leverages long running chats? If I have a chat that already has 100 iterations between the user and the AI, from that point of time, each invocation to the model will have 100+ events attached to the span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we chatted about it offline, sharing thoughts here for visibility:
100 iterations as a single huge event will be quite problematic:

  • it will be hard to use when debugging
  • it's mulch more likely to hit backend limit on event payload size

Having 100 events (not attached to spans, as log-based events are not) is not a great experience either, but

  • it's easier to read 100 independent messages than 1 huge message
  • each event will be smaller and less likely to hit limits
  • we can give events different severity allowing users to enable e.g. only responses, but not system or tool messages
  • we can support configuring a limit on last X events

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that chat sessions run into the same problem we have today with using OTel to model browser sessions, data processing jobs, etc. where there's meaningful operations that could each be a trace, but the end-user experience is really a collection of many traces and could be unbounded in length.

To that end, I prefer that we model spans (and, well, traces) for some...kind...of....err..."meaningful" group of operations:

  • A single chat request results in one LLM call or an entire network of agents or chains making N calls, this can be one trace
  • A "user presses a button to produce some output" is a trace of N spans, where N is the number of operations involved to produce that output, be it a complex RAG pipeline, multiple LLM calls, or just a basic pass-through to an LLM to produce a response in text directly

What I think is missing in all of this though is a better correlation ID that ties traces together. This would have to be domain-specific, though, and I don't really know how (or if) we want to consider modeling that right now.

Copy link
Contributor Author

@lmolkova lmolkova May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree - we'll need to add context identifier (e.g. gen-ai.thread.id) to correlate calls within the same conversation.

For this specific case: the conversation is carried over in each request. So the instrumentation can't do a lot - it can only record messages in the provided prompt (optionally allowing to filter or limit them).

It seems to be the API problem though. I might be wrong, but it feels like OpenAI assistant API solves this problem and takes care of storing the context (all the conversation). Given how fast APIs are evolving, we might not need to solve the message duplication problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think maybe for now even we could punt on the issue, since we technically do have that kind of causality already with span links.


#### `Message` object

The message structure matches one of the messages defined in this document depending on the role:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Shouldn't it only be an assistant message? AFAIK LLMs do not output tool (result) messages, only tool calls, but those are located inside the assistant message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks!

| Property | Value |
|---------------------|-------------------------------------------------------|
| `gen_ai.system` | `"openai"` |
| Event payload | `{"content":"You're a friendly bot that answers questions about OpenTelemetry."}` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought I had here is how we might go about giving these messages more structure over time. I recognize that a lot of this really just comes down to the system you're using, as different model APIs will return more or less structure than others, different keys for values, etc. I don't know how we might address this over time but I'd love to get into a world where the experience for most people involves nice, system-readable, structured information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example that shows payload as a json - I did add a bare-minimum structure in the event definitions above and agree that we need to keep extending it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Discussions
Development

Successfully merging this pull request may close these issues.

LLM: define common/system-specific event structure
6 participants