-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture GenAI prompts and completions as events or attributes #2010
Comments
Proposal We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default. If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need We'll still need to address the large|sensitive-content-by-reference in the future:
When this comes along, we'll provide a new way to opt-in into a new solution which might replace or could coexist with attributes. Stamping them as attributes now would allow us to provide a simple solution for existing tooling and some of less mature applications and would not block us from making progress on the proper solution for the large-content problem. |
So happy to see this proposal!
Another concern is about the long-term memory costs for in-proc telemetry tools.
Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.
Yes! In fact, this is what I think is ideal. But I believe we still have a lot of work to do before that day arrives:
|
Another way is to store a preview of that prompts/completion. Say, the first 1000 tokens, which will make the user easier for trouble shooting, at least they know what the prompts/completion are about.
Another way to solve this, is to keep the original data, and use OTel collector to remove the sensitive content if user wants to. |
I like this proposal! Coming from the experience with Spring AI, I see the value in having prompts and completions contextually available as part of a span (it's also the current behavior in the framework). It's a bit unfortunate span events have been deprecated, as that would have been my primary suggestion instead of span attributes. |
I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?
👍 I do want to add a few concerns I don't think we've discussed yet
|
Thank you for being open to revisit this decision. The initial experiment with log events was indeed both unpopular and expensive. Span attributes is the way to meet the ecosystem where they are, and allow us to focus on problems of genai instead of obscure infrastructure. It also allows not just the existing vendor ecosystem, but also systems like Jaeger to be first class spec implementations again. More details below since not everyone has had as long history with the prior state as I've had. I hope it helps: The High Cost of Log EventsThe current approach to events has demanded significant effort with limited payoff. Elastic alone has invested close to—or possibly more than—a person-year on this topic. This effort spanned:
UX made more difficultI’ve personally worked with projects like otel-tui and Zipkin to add log support specifically for this spec. The experience was more navigation than before with no benefit. Since otel only has a few genai instrumentation, you end up relying on 3rd parties like langtrace, openinference or openllmetry to fill in the gaps. Most use span attributes, so the full trace gets very confusing where some parts are in attributes and others need clicking or typing around to figure out the logs attached to something. Focus imperativeI'm not alone in needing a couple hours a day just for GenAI ecosystem change. We have to make decisions that relate to the focus that's important. A deeply troubled technical choice hurts our ability to stay relevant, as these problems steal time from what is. This is a primary reason so few vendors adopted the log events. In personal comments to me, many said they simply cannot afford to redo their arch and UX just to satisfy the log events. It would come at the cost of time spent in service to customers, so just didn't happen. We have options, but they are far less with log events.Since this started, we looked at many ways to get things to work. While most didn't pan out and/or were theoretical (collector processor can in theory do anything), we have a lot of options if we flip the tables back as suggested::
We don't have to boil the spec ocean in this decisionThis decision is about chat completions and similar APIs such as responses. It does not have to be the same decision for an as-yet unspecified semantic mapping for real time APIs. We shouldn't hold this choice hostage to unexplored areas which would vary wildly per vendor. chat completions is a high leverage API and many inference platforms support it the same way, by emulating OpenAI. Let's not get distracted about potential other APIs which might not fit. ConclusionIn summary, the experiment with events, especially logs, taught us valuable lessons but proved too costly and unpopular to sustain. By focusing on span attributes, we can reduce complexity, improve UX, and align with the ecosystem’s strengths—paving the way for a spec the community will embrace. I’m excited to see this revisited and look forward to refining OTel together, as a group, not just those who have log ingress like Elastic. |
Just popping in here to say that I think this is the right proposal. Whether it's the ideal way or not, most tools for evals (and arguably production monitoring of AI behavior) treat traces as the source of truth instead of connective tissue between other things. |
If storing large and sensitive content on the telemetry is a problem (it is) we'll have to solve it. Companies that work with enterprise customers would need to find a solution to this regardless of the signal large and sensitive data is recorded on.
Our current events don't work for realtime API and don't even attempt to cover multi-modal content. If we need to record point-in-time telemetry, the events would be the right choice.
OTel assumes you always create client and server-side spans for each network call. We don't document how to model GenAI proxy server spans and events and I don't believe they should repeat client ones anyway. The TL;DR: |
I've been thinking about these two extreme cases, and a spectrum of options between them:
Given that we're not dealing with real events (it's not point-in-time or independent from spans signal), it's weird to make local and early-days development experience harder - install extra components, apply extra config. So I consider prompts and completions (also tool definitions) to be verbose attributes. |
Let's also consider different verbosity profiles GenAI data could have (in the ascending order):
We can provide two config options:
It should be possible to configure them independently (e.g. no full content, but event per chunk) |
The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556)
What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to
Turns out that:
So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, open-telemetry/opentelemetry-specification#4414
The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.
How it can be useful without a span:
To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today
Are prompts/completions point-in-time telemetry?
Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events
Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry
It's problematic because of:
Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.
Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)
TL;DR:
The text was updated successfully, but these errors were encountered: