fix(llmobs): openai-java payload mapping for responses, tool metadata, and prompt tracking#10644
fix(llmobs): openai-java payload mapping for responses, tool metadata, and prompt tracking#10644
Conversation
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 61 metrics, 10 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.066 s) : 0, 1065982
Total [baseline] (11.034 s) : 0, 11034499
Agent [candidate] (1.067 s) : 0, 1067232
Total [candidate] (11.157 s) : 0, 11157281
section appsec
Agent [baseline] (1.251 s) : 0, 1251474
Total [baseline] (11.214 s) : 0, 11213666
Agent [candidate] (1.252 s) : 0, 1251814
Total [candidate] (11.214 s) : 0, 11214115
section iast
Agent [baseline] (1.236 s) : 0, 1235512
Total [baseline] (11.358 s) : 0, 11357729
Agent [candidate] (1.233 s) : 0, 1232603
Total [candidate] (11.389 s) : 0, 11388723
section profiling
Agent [baseline] (1.188 s) : 0, 1188429
Total [baseline] (11.125 s) : 0, 11125225
Agent [candidate] (1.188 s) : 0, 1187616
Total [candidate] (11.08 s) : 0, 11080254
gantt
title petclinic - break down per module: candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.205 ms) : 0, 1205
crashtracking [candidate] (1.205 ms) : 0, 1205
BytebuddyAgent [baseline] (633.286 ms) : 0, 633286
BytebuddyAgent [candidate] (633.527 ms) : 0, 633527
AgentMeter [baseline] (29.666 ms) : 0, 29666
AgentMeter [candidate] (29.55 ms) : 0, 29550
GlobalTracer [baseline] (259.607 ms) : 0, 259607
GlobalTracer [candidate] (260.293 ms) : 0, 260293
AppSec [baseline] (32.08 ms) : 0, 32080
AppSec [candidate] (31.964 ms) : 0, 31964
Debugger [baseline] (60.91 ms) : 0, 60910
Debugger [candidate] (60.797 ms) : 0, 60797
Remote Config [baseline] (594.453 µs) : 0, 594
Remote Config [candidate] (600.111 µs) : 0, 600
Telemetry [baseline] (8.094 ms) : 0, 8094
Telemetry [candidate] (8.795 ms) : 0, 8795
Flare Poller [baseline] (4.409 ms) : 0, 4409
Flare Poller [candidate] (4.337 ms) : 0, 4337
section appsec
crashtracking [baseline] (1.217 ms) : 0, 1217
crashtracking [candidate] (1.196 ms) : 0, 1196
BytebuddyAgent [baseline] (661.863 ms) : 0, 661863
BytebuddyAgent [candidate] (660.807 ms) : 0, 660807
AgentMeter [baseline] (12.205 ms) : 0, 12205
AgentMeter [candidate] (12.134 ms) : 0, 12134
GlobalTracer [baseline] (259.041 ms) : 0, 259041
GlobalTracer [candidate] (258.918 ms) : 0, 258918
AppSec [baseline] (177.815 ms) : 0, 177815
AppSec [candidate] (178.969 ms) : 0, 178969
Debugger [baseline] (66.048 ms) : 0, 66048
Debugger [candidate] (65.561 ms) : 0, 65561
Remote Config [baseline] (652.511 µs) : 0, 653
Remote Config [candidate] (660.975 µs) : 0, 661
Telemetry [baseline] (8.295 ms) : 0, 8295
Telemetry [candidate] (8.349 ms) : 0, 8349
Flare Poller [baseline] (3.551 ms) : 0, 3551
Flare Poller [candidate] (4.42 ms) : 0, 4420
IAST [baseline] (24.237 ms) : 0, 24237
IAST [candidate] (24.303 ms) : 0, 24303
section iast
crashtracking [baseline] (1.202 ms) : 0, 1202
crashtracking [candidate] (1.222 ms) : 0, 1222
BytebuddyAgent [baseline] (800.913 ms) : 0, 800913
BytebuddyAgent [candidate] (799.254 ms) : 0, 799254
AgentMeter [baseline] (11.578 ms) : 0, 11578
AgentMeter [candidate] (11.427 ms) : 0, 11427
GlobalTracer [baseline] (249.318 ms) : 0, 249318
GlobalTracer [candidate] (248.472 ms) : 0, 248472
AppSec [baseline] (26.646 ms) : 0, 26646
AppSec [candidate] (26.474 ms) : 0, 26474
Debugger [baseline] (70.282 ms) : 0, 70282
Debugger [candidate] (70.291 ms) : 0, 70291
Remote Config [baseline] (539.482 µs) : 0, 539
Remote Config [candidate] (541.683 µs) : 0, 542
Telemetry [baseline] (9.774 ms) : 0, 9774
Telemetry [candidate] (9.723 ms) : 0, 9723
Flare Poller [baseline] (3.563 ms) : 0, 3563
Flare Poller [candidate] (3.573 ms) : 0, 3573
IAST [baseline] (25.426 ms) : 0, 25426
IAST [candidate] (25.441 ms) : 0, 25441
section profiling
ProfilingAgent [baseline] (94.44 ms) : 0, 94440
ProfilingAgent [candidate] (93.874 ms) : 0, 93874
crashtracking [baseline] (1.183 ms) : 0, 1183
crashtracking [candidate] (1.181 ms) : 0, 1181
BytebuddyAgent [baseline] (685.267 ms) : 0, 685267
BytebuddyAgent [candidate] (685.71 ms) : 0, 685710
AgentMeter [baseline] (9.024 ms) : 0, 9024
AgentMeter [candidate] (9.029 ms) : 0, 9029
GlobalTracer [baseline] (216.145 ms) : 0, 216145
GlobalTracer [candidate] (216.332 ms) : 0, 216332
AppSec [baseline] (32.523 ms) : 0, 32523
AppSec [candidate] (32.374 ms) : 0, 32374
Debugger [baseline] (65.841 ms) : 0, 65841
Debugger [candidate] (66.089 ms) : 0, 66089
Remote Config [baseline] (567.356 µs) : 0, 567
Remote Config [candidate] (565.814 µs) : 0, 566
Telemetry [baseline] (7.807 ms) : 0, 7807
Telemetry [candidate] (7.693 ms) : 0, 7693
Flare Poller [baseline] (4.331 ms) : 0, 4331
Flare Poller [candidate] (3.479 ms) : 0, 3479
Profiling [baseline] (95.016 ms) : 0, 95016
Profiling [candidate] (94.434 ms) : 0, 94434
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.06 s) : 0, 1060219
Total [baseline] (8.829 s) : 0, 8829299
Agent [candidate] (1.058 s) : 0, 1058299
Total [candidate] (8.844 s) : 0, 8843544
section iast
Agent [baseline] (1.232 s) : 0, 1232137
Total [baseline] (9.579 s) : 0, 9578833
Agent [candidate] (1.231 s) : 0, 1231212
Total [candidate] (9.55 s) : 0, 9549685
gantt
title insecure-bank - break down per module: candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.217 ms) : 0, 1217
crashtracking [candidate] (1.207 ms) : 0, 1207
BytebuddyAgent [baseline] (631.03 ms) : 0, 631030
BytebuddyAgent [candidate] (630.07 ms) : 0, 630070
AgentMeter [baseline] (29.378 ms) : 0, 29378
AgentMeter [candidate] (29.539 ms) : 0, 29539
GlobalTracer [baseline] (257.528 ms) : 0, 257528
GlobalTracer [candidate] (257.604 ms) : 0, 257604
AppSec [baseline] (32.081 ms) : 0, 32081
AppSec [candidate] (31.867 ms) : 0, 31867
Debugger [baseline] (59.916 ms) : 0, 59916
Debugger [candidate] (59.734 ms) : 0, 59734
Remote Config [baseline] (590.982 µs) : 0, 591
Remote Config [candidate] (585.146 µs) : 0, 585
Telemetry [baseline] (8.775 ms) : 0, 8775
Telemetry [candidate] (8.03 ms) : 0, 8030
Flare Poller [baseline] (3.565 ms) : 0, 3565
Flare Poller [candidate] (3.516 ms) : 0, 3516
section iast
crashtracking [baseline] (1.208 ms) : 0, 1208
crashtracking [candidate] (1.196 ms) : 0, 1196
BytebuddyAgent [baseline] (799.787 ms) : 0, 799787
BytebuddyAgent [candidate] (799.511 ms) : 0, 799511
AgentMeter [baseline] (11.373 ms) : 0, 11373
AgentMeter [candidate] (11.405 ms) : 0, 11405
GlobalTracer [baseline] (248.079 ms) : 0, 248079
GlobalTracer [candidate] (248.357 ms) : 0, 248357
AppSec [baseline] (26.661 ms) : 0, 26661
AppSec [candidate] (26.539 ms) : 0, 26539
Debugger [baseline] (67.347 ms) : 0, 67347
Debugger [candidate] (68.264 ms) : 0, 68264
Remote Config [baseline] (536.997 µs) : 0, 537
Remote Config [candidate] (528.589 µs) : 0, 529
Telemetry [baseline] (11.364 ms) : 0, 11364
Telemetry [candidate] (10.096 ms) : 0, 10096
Flare Poller [baseline] (3.857 ms) : 0, 3857
Flare Poller [candidate] (3.649 ms) : 0, 3649
IAST [baseline] (25.45 ms) : 0, 25450
IAST [candidate] (25.466 ms) : 0, 25466
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 20 metrics, 16 unstable metrics. Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section baseline
no_agent (1.234 ms) : 1222, 1246
. : milestone, 1234,
iast (3.242 ms) : 3199, 3284
. : milestone, 3242,
iast_FULL (5.863 ms) : 5805, 5921
. : milestone, 5863,
iast_GLOBAL (3.561 ms) : 3500, 3622
. : milestone, 3561,
profiling (2.159 ms) : 2139, 2180
. : milestone, 2159,
tracing (1.872 ms) : 1856, 1887
. : milestone, 1872,
section candidate
no_agent (1.217 ms) : 1207, 1228
. : milestone, 1217,
iast (3.361 ms) : 3317, 3406
. : milestone, 3361,
iast_FULL (5.886 ms) : 5828, 5945
. : milestone, 5886,
iast_GLOBAL (3.599 ms) : 3544, 3654
. : milestone, 3599,
profiling (2.094 ms) : 2076, 2112
. : milestone, 2094,
tracing (1.856 ms) : 1840, 1873
. : milestone, 1856,
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section baseline
no_agent (19.196 ms) : 19001, 19391
. : milestone, 19196,
appsec (18.655 ms) : 18466, 18845
. : milestone, 18655,
code_origins (17.903 ms) : 17726, 18081
. : milestone, 17903,
iast (18.397 ms) : 18211, 18584
. : milestone, 18397,
profiling (19.219 ms) : 19028, 19410
. : milestone, 19219,
tracing (18.04 ms) : 17863, 18218
. : milestone, 18040,
section candidate
no_agent (19.31 ms) : 19115, 19506
. : milestone, 19310,
appsec (18.903 ms) : 18708, 19098
. : milestone, 18903,
code_origins (18.019 ms) : 17841, 18197
. : milestone, 18019,
iast (19.245 ms) : 19054, 19437
. : milestone, 19245,
profiling (18.781 ms) : 18593, 18969
. : milestone, 18781,
tracing (18.845 ms) : 18656, 19034
. : milestone, 18845,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section baseline
no_agent (1.478 ms) : 1467, 1490
. : milestone, 1478,
appsec (3.775 ms) : 3554, 3996
. : milestone, 3775,
iast (2.26 ms) : 2191, 2329
. : milestone, 2260,
iast_GLOBAL (2.306 ms) : 2236, 2376
. : milestone, 2306,
profiling (2.081 ms) : 2026, 2135
. : milestone, 2081,
tracing (2.088 ms) : 2034, 2142
. : milestone, 2088,
section candidate
no_agent (1.484 ms) : 1473, 1496
. : milestone, 1484,
appsec (3.803 ms) : 3581, 4024
. : milestone, 3803,
iast (2.258 ms) : 2189, 2327
. : milestone, 2258,
iast_GLOBAL (2.307 ms) : 2237, 2376
. : milestone, 2307,
profiling (2.101 ms) : 2046, 2157
. : milestone, 2101,
tracing (2.068 ms) : 2014, 2121
. : milestone, 2068,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.60.0-SNAPSHOT~d7d4866358, baseline=1.61.0-SNAPSHOT~5580c61ac4
dateFormat X
axisFormat %s
section baseline
no_agent (14.88 s) : 14880000, 14880000
. : milestone, 14880000,
appsec (14.608 s) : 14608000, 14608000
. : milestone, 14608000,
iast (18.083 s) : 18083000, 18083000
. : milestone, 18083000,
iast_GLOBAL (18.14 s) : 18140000, 18140000
. : milestone, 18140000,
profiling (15.717 s) : 15717000, 15717000
. : milestone, 15717000,
tracing (14.988 s) : 14988000, 14988000
. : milestone, 14988000,
section candidate
no_agent (15.407 s) : 15407000, 15407000
. : milestone, 15407000,
appsec (14.593 s) : 14593000, 14593000
. : milestone, 14593000,
iast (18.251 s) : 18251000, 18251000
. : milestone, 18251000,
iast_GLOBAL (17.785 s) : 17785000, 17785000
. : milestone, 17785000,
profiling (15.464 s) : 15464000, 15464000
. : milestone, 15464000,
tracing (15.094 s) : 15094000, 15094000
. : milestone, 15094000,
|
5cd257e to
cbd6226
Compare
…wthTestOpenAiLlmInteractions::test_completion
…teractions::test_chat_completion_tool_call
…d with python openai instrumentation and system-tests
… with variables + chat_template, longest-first overlap handling) and support map-based LLM input serialization (messages + prompt) in LLMObs mapper. Also filter empty instruction messages to match system-test expectations.
…st and return [image] (not empty) when stripped input_image URLs are missing, aligning mixed-input chat_template output with expected behavior.
…output.messages from request params so existing error-span tests pass.
…ol_definitions tags
…JSON argument parsing and remove duplicate manual parsing logic from ResponseDecorator.
Kyle-Verhoog
left a comment
There was a problem hiding this comment.
LLMObs Team Review
Nice work aligning the Java SDK payloads with the intake schema — this is a big step for system test compliance. A few items to address/clarify below (inline), plus some overall notes:
Test Coverage Notes
What's well-covered: LLMObsSpanMapperTest expansion is great — covers _dd map, nested meta.error, map-based input with prompt/chat_template, tool definitions, tool calls + tool results. The decorator tests verify the new tags (source, integration, error, ddtrace.version).
Gaps to consider:
- Error paths: No test exercises the error-path defaults (model_name and empty output set during
withResponseCreateParamswhen the HTTP call fails). A test where the response errors out and verifying the span still hasmodel_nameand placeholder output would be valuable. - Prompt tracking:
enrichInputWithPromptTracking(),extractChatTemplate(),extractPromptFromParams(), andnormalizePromptVariable()have no unit tests. Template variable replacement edge cases (overlapping values, empty variables, image/file fallbacks) would increase confidence. - Custom/MCP tool calls:
ToolCallExtractor.getToolCall(ResponseCustomToolCall)andgetToolCall(McpCall)are new with no unit tests. JsonValueUtils: New utility class with no dedicated tests for recursive JSON-to-Object conversion.
Questions
- The min version bump from
3.0.0to3.0.1— what API was missing in3.0.0? This affects which customer versions get instrumented. - For the
_ddmap — does the intake expectapm_trace_idto equaltrace_id? In other SDKs these can differ (APM trace ID vs LLMObs ID).
dd-trace-core/src/main/java/datadog/trace/llmobs/writer/ddintake/LLMObsSpanMapper.java
Show resolved
Hide resolved
|
|
||
| boolean errored = span.getError() == 1; | ||
| writable.writeUTF8(STATUS); | ||
| writable.writeString(span.getError() == 0 ? "ok" : "error", null); |
There was a problem hiding this comment.
The top-level error: 0/1 integer field has been removed and replaced with status: "ok"/"error" + error details nested under meta.error. Can you confirm no downstream consumers (EvP remapper, indexer facets, etc.) read error from the top level? This is a payload shape change that could be breaking if anything depends on the old field.
There was a problem hiding this comment.
This change is dictated by the TestOpenAiLlmInteractions::test_chat_completion assertion. I assume that the system test assertions are correct. Have they been verified as being compliant with the requirements of downstream consumers?
There was a problem hiding this comment.
If I leave the top-level error field, the system test will fail.
...enai-java-3.0/src/main/java/datadog/trace/instrumentation/openai_java/ResponseDecorator.java
Outdated
Show resolved
Hide resolved
dd-trace-core/src/main/java/datadog/trace/llmobs/writer/ddintake/LLMObsSpanMapper.java
Show resolved
Hide resolved
6dcdaf4 to
717a8f0
Compare
| apply from: "$rootDir/gradle/java.gradle" | ||
|
|
||
| def minVer = '3.0.0' | ||
| def minVer = '3.0.1' |
There was a problem hiding this comment.
ResponseTextConfig fun verbosity(): Optional<Verbosity> was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73
ResponseTextConfig fun verbosity(): Optional was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73
This is aligned with dd-trace-py https://github.com/DataDog/dd-trace-py/blob/876c5f1ce4d173815537798a6a7b0ac15b0a4ede/ddtrace/llmobs/_llmobs.py#L618-L622. |
…and placeholder output set by withResponseCreateParams.
…f enrichInputWithPromptTracking(), extractChatTemplate(), extractPromptFromParams(), and normalizePromptVariable()
… format. Test cover extractPromptFromParams and related methods
amarziali
left a comment
There was a problem hiding this comment.
apm-java has just the TagAssert file concerned. So overall delegating to llmops / idm the review
What Does This Do
Aligns OpenAI Java LLMObs span payloads with expected intake/system-test schema by:
_ml_obs_tag.integration_ml_obs_tag.source_ml_obs_tag.ddtrace.version_ml_obs_tag.error_ml_obs_tag.error_typemodel_name(and stable placeholder output where applicable) is set on error paths forchat/completions/embeddings/responses.
input.prompt,variables,chat_template)tool_definitions)stream,tool_choice,text.verbosity, etc.)_ddmap with span/trace idsmeta.errorinputserialization (messages+prompt)tool_definitionsintometa.Motivation
OpenAI/LLMObs system tests exposed schema and tag mismatches in Java payloads (especially response spans, tool metadata, error mapping, and prompt tracking structure). This change brings Java output in line with expected LLMObs intake contract and behavior.
Additional Notes
openai-java-3.0min version updated from3.0.0to3.0.1.ResponseTextConfig
fun verbosity(): Optional<Verbosity>was added in 3.0.1 openai/openai-java@c1de354#diff-6b385fb153d457757ba112e6117593cb59da6af308cce0f9b6f26e3885befc6cR73DataDog/dd-apm-test-agent#280
DataDog/system-tests#6364
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.