Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture GenAI prompts and completions as events or attributes #2010

Open
lmolkova opened this issue Mar 19, 2025 · 10 comments
Open

Capture GenAI prompts and completions as events or attributes #2010

lmolkova opened this issue Mar 19, 2025 · 10 comments

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Mar 19, 2025

The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556)

What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to

  • overcome size limits on attribute values by using event body
  • use a signal that supports structured body and attributes
  • have a clear 1:1 relationship between event name and structure (as opposed to polymorphic types or arrays of heterogeneous objects)
  • make it possible and easy to consume individual events and prompts/completions without spans
  • have verbosity controls

Turns out that:

  • after ~9 months events are still not adopted by GenAI-focused tracing tools and their external instrumentation libs including Arize, Traceloop, Langtrace - all these providers use span attributes to capture prompts and completions.
  • These backends consume prompts and completions along with spans and don't envision separating them - they store and visualize this data altogether

So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, open-telemetry/opentelemetry-specification#4414


The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.

How it can be useful without a span:

To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today

Are prompts/completions point-in-time telemetry?

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events


Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry

It's problematic because of:

  • privacy - prompts can contain health concerns, ssns, addresses, names, etc. Apps that remain compliant with different regulators would have a problem of sharing this data with a broad audience of DevOps humans. The data should be accessible for evaluations, audit, but access should be restricted
  • size - non-GenAI specific backends are not optimized for this and it's expensive to store such data in hot storage.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)


TL;DR:

  • current approach doesn't work, we're blocked and need to find path forward.
  • GenAI-focused backends, innerloop scenarios, non-production apps would benefit from having prompts/completions stamped on the spans directly
  • General-purpose observability backends and high-scale applications would have a problem with sensitive/large/binary data coming from end-users on telemetry anyway
@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 19, 2025

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out.
(Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

  • it will involve a separate set of attributes to record references
  • the content will likely be stored separately from spans or logs

When this comes along, we'll provide a new way to opt-in into a new solution which might replace or could coexist with attributes.

Stamping them as attributes now would allow us to provide a simple solution for existing tooling and some of less mature applications and would not block us from making progress on the proper solution for the large-content problem.

@Cirilla-zmh
Copy link
Member

Cirilla-zmh commented Mar 20, 2025

So happy to see this proposal!

Streaming chunks, if captured at all, would have timestamps (#1964)

Another concern is about the long-term memory costs for in-proc telemetry tools.

Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events.

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

Yes! In fact, this is what I think is ideal. But I believe we still have a lot of work to do before that day arrives:

  • We still need to format the data to prevent it from becoming heterogeneous.
  • We may also need to provide some implementation/best practice use cases to show how the observables backend/evaluators get this data properly.

@ralf0131
Copy link

Proposal

We don't capture prompts/completion contents by default - that's already the case, so we don't need to pick a new default.

Another way is to store a preview of that prompts/completion. Say, the first 1000 tokens, which will make the user easier for trouble shooting, at least they know what the prompts/completion are about.

If user explicitly enables contents, lets stamp them on spans as attributes - let's define the format for them and put them as json strings. Maybe we need gen_ai.inputs, gen_ai.outputs, gen_ai.tool.calls, gen_ai.tool.definitions, etc - we'll figure it out. (Related: open-telemetry/opentelemetry-specification#4446)

We'll still need to address the large|sensitive-content-by-reference in the future:

  • it will involve a separate set of attributes to record references
  • the content will likely be stored separately from spans or logs

Another way to solve this, is to keep the original data, and use OTel collector to remove the sensitive content if user wants to.

@ThomasVitale
Copy link

I like this proposal! Coming from the experience with Spring AI, I see the value in having prompts and completions contextually available as part of a span (it's also the current behavior in the framework). It's a bit unfortunate span events have been deprecated, as that would have been my primary suggestion instead of span attributes.

@aabmass
Copy link
Member

aabmass commented Mar 20, 2025

Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

Agree in part. Standalone events do have some use-cases, such as those evaluations which only need the input/output text. But it would be nature if the evaluation result can be easily linked to a span or trace, because that's the real world of a complete GenAI session.

👍

I do want to add a few concerns I don't think we've discussed yet

  1. Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.
  2. Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

@codefromthecrypt
Copy link
Contributor

codefromthecrypt commented Mar 21, 2025

Thank you for being open to revisit this decision. The initial experiment with log events was indeed both unpopular and expensive. Span attributes is the way to meet the ecosystem where they are, and allow us to focus on problems of genai instead of obscure infrastructure. It also allows not just the existing vendor ecosystem, but also systems like Jaeger to be first class spec implementations again.

More details below since not everyone has had as long history with the prior state as I've had. I hope it helps:


The High Cost of Log Events

The current approach to events has demanded significant effort with limited payoff. Elastic alone has invested close to—or possibly more than—a person-year on this topic. This effort spanned:

  • Debates: High volume discussions, made longer due to being about an unimplemented feature of otel
  • Community Struggles: Angst created by committing a spec change no eval company could adopt
  • Implementation Challenges: Implementing anyway then having version lock-up or lack of feature clarity per language
  • Infrastructure Challenges: Finding out technology like ottl can't bridge log events back to the span
  • Portability Challenges: Knowingly limiting systems like Jaeger from being full featured due to required log support

UX made more difficult

I’ve personally worked with projects like otel-tui and Zipkin to add log support specifically for this spec. The experience was more navigation than before with no benefit. Since otel only has a few genai instrumentation, you end up relying on 3rd parties like langtrace, openinference or openllmetry to fill in the gaps. Most use span attributes, so the full trace gets very confusing where some parts are in attributes and others need clicking or typing around to figure out the logs attached to something.

Focus imperative

I'm not alone in needing a couple hours a day just for GenAI ecosystem change. We have to make decisions that relate to the focus that's important. A deeply troubled technical choice hurts our ability to stay relevant, as these problems steal time from what is. This is a primary reason so few vendors adopted the log events. In personal comments to me, many said they simply cannot afford to redo their arch and UX just to satisfy the log events. It would come at the cost of time spent in service to customers, so just didn't happen.

We have options, but they are far less with log events.

Since this started, we looked at many ways to get things to work. While most didn't pan out and/or were theoretical (collector processor can in theory do anything), we have a lot of options if we flip the tables back as suggested::

  • language SDKs can provide hooks to control data policy and mapping (e.g. to span events)
  • OTTL can do the same when everything is in the same span
  • There's a blob uploader API in progress which could also optionally be used for sites that have thresholds where links cause more good than harm.

We don't have to boil the spec ocean in this decision

This decision is about chat completions and similar APIs such as responses. It does not have to be the same decision for an as-yet unspecified semantic mapping for real time APIs. We shouldn't hold this choice hostage to unexplored areas which would vary wildly per vendor. chat completions is a high leverage API and many inference platforms support it the same way, by emulating OpenAI. Let's not get distracted about potential other APIs which might not fit.

Conclusion

In summary, the experiment with events, especially logs, taught us valuable lessons but proved too costly and unpopular to sustain. By focusing on span attributes, we can reduce complexity, improve UX, and align with the ecosystem’s strengths—paving the way for a spec the community will embrace. I’m excited to see this revisited and look forward to refining OTel together, as a group, not just those who have log ingress like Elastic.

@cartermp
Copy link
Contributor

cartermp commented Mar 21, 2025

Just popping in here to say that I think this is the right proposal. Whether it's the ideal way or not, most tools for evals (and arguably production monitoring of AI behavior) treat traces as the source of truth instead of connective tissue between other things.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 22, 2025

@aabmass

I agree this is really nice idea, but is very forward thinking with data portability. Do we have any good signal that this is something that will eventually happen?

If storing large and sensitive content on the telemetry is a problem (it is) we'll have to solve it. Companies that work with enterprise customers would need to find a solution to this regardless of the signal large and sensitive data is recorded on.

I do want to add a few concerns I don't think we've discussed yet

  1. Bidi streaming completions. Both OpenAI Realtime API and VertexAI Live API support this, where the user and model communicate over an ordered channel like a WebSocket. It's not clear how to model this as spans but events seem like a more obvious fit. I imagine this will eventually be in scope.

Our current events don't work for realtime API and don't even attempt to cover multi-modal content. If we need to record point-in-time telemetry, the events would be the right choice.
So streaming chunks are likely a good candidates for events. But prompts and buffered completions are not point-in-time - they don't happen at a specific time not covered by span start/end timestamps.

  1. Server side prompt/response capturing. A proxy (like TraceLoop Hub and others) or the inference engine itself (like self hosted vLLM or an LLM vendor) could emit prompt/responses. Since you can't modify remote spans, they would need to create server side spans as well to stamp the data onto. Vs with events, you can attach a log to any span ID. Maybe the "by reference" thing would solve it.

OTel assumes you always create client and server-side spans for each network call. We don't document how to model GenAI proxy server spans and events and I don't believe they should repeat client ones anyway.

The TL;DR:
This proposal affects current events that describe prompts and fully buffered completion. If we come up with a criteria of what should be reported as an event we'd say that it should have a meaningful timestamp and could be useful without a span. It seems neither is true for the events we have today.
It doesn't mean that we should always use span attributes for every future GenAI content - absolutely not.

@lmolkova
Copy link
Contributor Author

I've been thinking about these two extreme cases, and a spectrum of options between them:

  1. Inner-loop/local experience, non-production applications, or application that can get their production obs needs satisfied with existing GenAI-focused backends/tooling. Their needs:

    • easy setup, easy to use on any backend and in local observability tools (local Jaeger, otel-tui, Aspire, etc)
    • verbose data, telemetry volume is not a huge concern
    • don't require compliance with regulators

    Prompt and buffered completion content passed on span attributes fit nicely:

    • prompts and full completion don't have a timestamp
    • easy OTel setup
  2. Enterprise applications:

    • need different access permissions for prompts/completions vs regular performance/usage telemetry
      • sensitive data should be annotated and potentially forwarded to a separate storage/tenant
    • telemetry pipelines need to be tuned for long chats and multi-modal content
      • it's not typical to use otel pipelines with large data, need different batching strategies
      • congestion caused by large content may affect other data and the basic monitoring capabilities
    • audit/compliance logs
      • pipelines that can provide necessary delivery guarantees for a subset of events and/or spans

    These apps can tolerate additional configuration - the OTel setup is usually complicated enough.

    Regardless of how prompts and full completions are recorded (events or attributes), they need some special handling around privacy and size. Events don't solve this on its own and GenAI telemetry still needs a lot of special processing to meet enterprise app needs.

    Content stamped by reference on spans and potentially uploaded via a special channel for such data would satisfy most of these needs without requiring changes for spans and logs or backends that don't like large content.
    And of course we can explore allowing to pass content by value.

Given that we're not dealing with real events (it's not point-in-time or independent from spans signal), it's weird to make local and early-days development experience harder - install extra components, apply extra config. So I consider prompts and completions (also tool definitions) to be verbose attributes.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 22, 2025

Let's also consider different verbosity profiles GenAI data could have (in the ascending order):

  1. [Default] spans, but no contents
  • Best for performance, the most frugal in terms of telemetry volume, no sensitive data on telemetry
  1. Spans have reference to the full contents, upload it somewhere accessible to the tooling
    • Traditional telemetry pipelines and volume are not affected. Sensitive data is reported on a different channel.
    • This could be a safe-ish default if we forget about perf impact
  2. Spans/events contain have full (buffered) content unified across models/providers
    • High spans/events volume, sensitive data in a general telemetry stream
  3. Spans/events contain have full content that captures model request and response as is (useful for audit/compliance logs, record/replay features)
  4. Events with streaming chunks and their content: maximum level of observability and volume. Note: basic event envelope may be 100x times bigger than the actual content it carries.

We can provide two config options:

  1. choose how to report full content:
    • don't
    • report by value (attributes)
    • report by reference (export/upload via a separate channel)
  2. opt into low-level data:
    • exact model request and response (audit and replay) - it should go on the channel that handles sensitive/large content
    • per-chunk events

It should be possible to configure them independently (e.g. no full content, but event per chunk)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants