Introduces per-message structured GenAI events #980

lmolkova · 2024-04-29T18:22:32Z

Fixes #834

Changes

Defines gen-ai specific events along with their structure.
Related to #954, #829

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
~~schema-next.yaml updated with changes to existing conventions.~~

TaoChenOSU · 2024-05-01T19:15:17Z

docs/gen-ai/llm-spans.md

 | [`gen_ai.usage.completion_tokens`](../attributes-registry/llm.md) | int | The number of tokens used in the LLM response (completion). | `180` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
 | [`gen_ai.usage.prompt_tokens`](../attributes-registry/llm.md) | int | The number of tokens used in the LLM prompt. | `100` | `Recommended` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

 **[1]:** The name of the LLM a request is being made to. If the LLM is supplied by a vendor, then the value must be the exact name of the model requested. If the LLM is a fine-tuned custom model, the value should have a more specific name than the base model that's been fine-tuned.

 **[2]:** If not using a vendor-supplied model, provide a custom friendly name, such as a name of the company or project. If the instrumetnation reports any attributes specific to a custom model, the value provided in the `gen_ai.system` SHOULD match the custom attribute namespace segment. For example, if `gen_ai.system` is set to `the_best_llm`, custom attributes should be added in the `gen_ai.the_best_llm.*` namespace. If none of above options apply, the instrumentation should set `_OTHER`.

-**[3]:** If available. The name of the LLM serving a response. If the LLM is supplied by a vendor, then the value must be the exact name of the model actually used. If the LLM is a fine-tuned custom model, the value should have a more specific name than the base model that's been fine-tuned.
+**[3]:** If there is more than one finish reason in the response, the last one should be reported.


Does it recommend to only report one finish reason even when there are multiple response candidates returned by the model?

The individual finish reasons is reported in individual events for each choice.

I decided to change finish reason to a single string since:

It seems only(?) OpenAI returns more than one choice and it's not quite a popular scenario (most samples are based on the assumption that n=1)

I could not really simulate a case when different finish reasons would be returned

I.e. in most cases there will be one choice or one reason on the span.
What if it's not enough?

The alternatives are:

remove finish reason from the span altogether, but then it won't even be available on metrics

keep an array attribute. It won't be usable on metrics and in hard to use in custom queries

join all finish reasons to a single string and keep a comma-separated list sorted by the choice index.

I think options 1 and 2 are bad since they don't allow to use finish reason on metrics.

Option 3 seems reasonable though.

docs/gen-ai/llm-spans.md

TaoChenOSU · 2024-05-02T20:48:23Z

docs/gen-ai/llm-spans.md

 <!-- endsemconv -->

 ## Events

-In the lifetime of an LLM span, an event for prompts sent and completions received MAY be created, depending on the configuration of the instrumentation.
+In the lifetime of an GenAI span, an event for each message application sends to GenAI and receives in response from it MAY be created, depending on the configuration of the instrumentation.


Have we considered the debugging experience when an app leverages long running chats? If I have a chat that already has 100 iterations between the user and the AI, from that point of time, each invocation to the model will have 100+ events attached to the span.

I think we chatted about it offline, sharing thoughts here for visibility:
100 iterations as a single huge event will be quite problematic:

it will be hard to use when debugging

it's mulch more likely to hit backend limit on event payload size

Having 100 events (not attached to spans, as log-based events are not) is not a great experience either, but

it's easier to read 100 independent messages than 1 huge message

each event will be smaller and less likely to hit limits

we can give events different severity allowing users to enable e.g. only responses, but not system or tool messages

we can support configuring a limit on last X events

I think that chat sessions run into the same problem we have today with using OTel to model browser sessions, data processing jobs, etc. where there's meaningful operations that could each be a trace, but the end-user experience is really a collection of many traces and could be unbounded in length.

To that end, I prefer that we model spans (and, well, traces) for some...kind...of....err..."meaningful" group of operations:

A single chat request results in one LLM call or an entire network of agents or chains making N calls, this can be one trace

A "user presses a button to produce some output" is a trace of N spans, where N is the number of operations involved to produce that output, be it a complex RAG pipeline, multiple LLM calls, or just a basic pass-through to an LLM to produce a response in text directly

What I think is missing in all of this though is a better correlation ID that ties traces together. This would have to be domain-specific, though, and I don't really know how (or if) we want to consider modeling that right now.

In general, I agree - we'll need to add context identifier (e.g. gen-ai.thread.id) to correlate calls within the same conversation.

For this specific case: the conversation is carried over in each request. So the instrumentation can't do a lot - it can only record messages in the provided prompt (optionally allowing to filter or limit them).

It seems to be the API problem though. I might be wrong, but it feels like OpenAI assistant API solves this problem and takes care of storing the context (all the conversation). Given how fast APIs are evolving, we might not need to solve the message duplication problem.

Yeah, I think maybe for now even we could punt on the issue, since we technically do have that kind of causality already with span links.

langchain4j · 2024-05-08T16:40:29Z

docs/gen-ai/llm-spans.md

+
+#### `Message` object
+
+The message structure matches one of the messages defined in this document depending on the role:


Hi! Shouldn't it only be an assistant message? AFAIK LLMs do not output tool (result) messages, only tool calls, but those are located inside the assistant message.

good catch, thanks!

cartermp · 2024-05-09T18:33:40Z

docs/gen-ai/llm-spans.md

+   |   Property          |                     Value                             |
+   |---------------------|-------------------------------------------------------|
+   | `gen_ai.system`     | `"openai"`                                            |
+   | Event payload       | `{"content":"You're a friendly bot that answers questions about OpenTelemetry."}` |


A thought I had here is how we might go about giving these messages more structure over time. I recognize that a lot of this really just comes down to the system you're using, as different model APIs will return more or less structure than others, different keys for values, etc. I don't know how we might address this over time but I'd love to get into a world where the experience for most people involves nice, system-readable, structured information.

This is an example that shows payload as a json - I did add a bare-minimum structure in the event definitions above and agree that we need to keep extending it

lmolkova requested review from a team as code owners April 29, 2024 18:22

github-actions bot assigned jsuereth Apr 29, 2024

This was referenced Apr 29, 2024

OTel semconv: First stab at events traceloop/semantic-conventions#3

Closed

Define span events to events mapping open-telemetry/opentelemetry-specification#4023

Closed

Sensitive Data Redaction open-telemetry/oteps#255

Draft

TaoChenOSU reviewed May 1, 2024

View reviewed changes

docs/gen-ai/llm-spans.md Show resolved Hide resolved

TaoChenOSU reviewed May 1, 2024

View reviewed changes

docs/gen-ai/llm-spans.md Outdated Show resolved Hide resolved

TaoChenOSU reviewed May 1, 2024

View reviewed changes

docs/gen-ai/llm-spans.md Outdated Show resolved Hide resolved

TaoChenOSU reviewed May 2, 2024

View reviewed changes

lmolkova unassigned jsuereth May 7, 2024

lmolkova force-pushed the gen-ai-events branch from cc65371 to 1d913bc Compare May 8, 2024 05:26

lmolkova requested a review from a team as a code owner May 8, 2024 05:26

langchain4j reviewed May 8, 2024

View reviewed changes

cartermp reviewed May 9, 2024

View reviewed changes

nirga added the area:gen-ai label May 22, 2024

lmolkova added 7 commits May 29, 2024 15:10

First stab at events

4900fdc

Add details, cleanup

12faa46

Lint and examples

ac48bf3

make role optional

2a30cab

clean up and add gen-ai specific temp attribute for payload

dbf10e0

up

7bc6968

review comments and rebase

a01ba75

lmolkova force-pushed the gen-ai-events branch from 64d8ca0 to a01ba75 Compare May 29, 2024 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduces per-message structured GenAI events #980

Introduces per-message structured GenAI events #980

lmolkova commented Apr 29, 2024

TaoChenOSU May 1, 2024

lmolkova May 8, 2024 •

edited

TaoChenOSU May 2, 2024

lmolkova May 8, 2024

cartermp May 9, 2024

lmolkova May 29, 2024 •

edited

cartermp Jun 5, 2024

langchain4j May 8, 2024

lmolkova May 29, 2024

cartermp May 9, 2024

lmolkova May 29, 2024


		#### `Message` object

		The message structure matches one of the messages defined in this document depending on the role:

Introduces per-message structured GenAI events #980

Are you sure you want to change the base?

Introduces per-message structured GenAI events #980

Conversation

lmolkova commented Apr 29, 2024

Changes

Merge requirement checklist

Choose a reason for hiding this comment

lmolkova May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova May 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova May 8, 2024 •

edited

lmolkova May 29, 2024 •

edited