Skip to content

Integrate LangSmith Observability with Evaluation and Optimization#1593

Merged
rapids-bot[bot] merged 15 commits intoNVIDIA:developfrom
pastorsj:spastoriza/langsmith_optimizations
Feb 23, 2026
Merged

Integrate LangSmith Observability with Evaluation and Optimization#1593
rapids-bot[bot] merged 15 commits intoNVIDIA:developfrom
pastorsj:spastoriza/langsmith_optimizations

Conversation

@pastorsj
Copy link
Copy Markdown
Contributor

@pastorsj pastorsj commented Feb 12, 2026

Description

  • Add LangSmith as an observability backend for eval and optimization — creates datasets, links OTEL traces to examples, attaches evaluator scores as feedback, and pushes optimized prompts to LangSmith prompt management
  • Refactor callback creation to use a registry-based plugin pattern (@register_eval_callback, @register_optimizer_callback) so core has no hardcoded backend references

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Examples in LangSmith

Evaluations

  • When running evaluations and enabling observability in LangSmith, developers are able to review the results of evaluations in LangSmith as well. See the following screenshots for examples of this.
image image

Optimizations

  • When running optimizations on agents and enabling observability in LangSmith, developers are able to review the results of those optimization runs in LangSmith as well. See the following screenshots for examples of this.
image image

Summary by CodeRabbit

  • New Features

    • LangSmith integration for evaluation and optimization (project/dataset creation, run linking, prompts).
    • Pluggable callback framework for evaluation and optimizer workflows.
    • Prompt-format support for optimization (f-string, jinja2, mustache).
    • Optional per-trial experiment routing and richer trial/result metadata emissions.
  • Bug Fixes / Behavior

    • Exporter lifecycle is now reentrant—nested start/stop contexts share a single lifecycle.
  • Documentation

    • Added LangSmith-enabled example configs and run instructions.
  • Tests

    • Extensive unit and integration tests for LangSmith and callback behaviors.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds pluggable evaluation and optimizer callback frameworks, wires callbacks into evaluation and numeric/prompt optimization flows with per-trial tracing, implements LangSmith evaluation/optimization callbacks and registration, updates exporter lifecycle and data models, and includes extensive unit and integration tests.

Changes

Cohort / File(s) Summary
Path & packaging
ci/scripts/path_checks.py, examples/RAG/simple_rag/pyproject.toml, packages/nvidia_nat_langchain/pyproject.toml
Allowlist entry added; example package discovery exclude added; langsmith>=0.6,<1.0 dependency added.
Example LangSmith configs
examples/.../configs/config-langsmith-*.yml
Added LangSmith evaluation and optimization YAML configs for examples (email_phishing_analyzer, simple_calculator_observability).
Eval callback infra
packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py, packages/nvidia_nat_core/tests/eval/*
New EvalResult/EvalResultItem dataclasses, EvalCallback Protocol and EvalCallbackManager; unit tests for manager behavior.
Evaluation flow integration
packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py, packages/nvidia_nat_core/src/nat/eval/evaluate.py
New _build_eval_callback_manager; EvaluationRun accepts optional callback_manager and invokes on_dataset_loaded/on_eval_complete; exporter lifecycle kept alive across items.
Type registry & registration decorators
packages/nvidia_nat_core/src/nat/cli/type_registry.py, packages/nvidia_nat_core/src/nat/cli/register_workflow.py
Added RegisteredEvalCallback/RegisteredOptimizerCallback models, registries and getters, plus decorators register_eval_callback and register_optimizer_callback.
Optimizer callback infra
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py, .../optimizer_runtime.py, .../parameter_optimizer.py, .../prompt_optimizer.py, packages/nvidia_nat_core/tests/profiler/*
New TrialResult dataclass, OptimizerCallback Protocol and OptimizerCallbackManager; wiring into numeric/prompt optimizers, per-trial OTEL routing, updated return shapes and lifecycle callback emissions; tests added/updated.
LangSmith plugin & registration
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/*, packages/nvidia_nat_langchain/src/nat/plugins/langchain/register.py, packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/__init__.py
New LangSmith plugin package and register module; LangSmithEvaluationCallback and LangSmithOptimizationCallback implemented; builders registered to registry.
LangSmith tests
packages/nvidia_nat_langchain/tests/langsmith/*
Extensive unit and async integration tests for dataset creation, OTEL run matching/linking, prompt handling, and optimization flows.
ExporterManager & tests
packages/nvidia_nat_core/src/nat/observability/exporter_manager.py, packages/nvidia_nat_core/tests/nat/observability/test_exporter_manager.py
ExporterManager made reentrant with reference counting (_ref_count); tests updated to assert reentrant/shared lifecycle semantics.
Data models & tests
packages/nvidia_nat_core/src/nat/data_models/optimizable.py, packages/nvidia_nat_core/tests/nat/data_models/test_optimizable.py
Added SearchSpace.prompt_format field (Literal of template types); tests for valid/invalid values.
Test utilities & small updates
packages/nvidia_nat_core/tests/nat/eval/test_evaluate.py, packages/nvidia_nat_core/tests/nat/profiler/test_optimizer_runtime_extra.py, packages/nvidia_nat_core/tests/nat/profiler/test_parameter_optimizer.py
Added _MockExporterManager to tests; tests adjusted for optimize_parameters new return tuple (tuned_cfg, best_params, n_trials).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Runner
    participant Registry as GlobalTypeRegistry
    participant CallbackMgr as EvalCallbackManager
    participant EvalRun as EvaluationRun
    participant Exporter as TelemetryExporter
    participant LangSmith as LangSmith API

    CLI->>Registry: query eval callback factories
    Registry-->>CallbackMgr: instantiate callbacks
    CLI->>EvalRun: create with callback_manager
    EvalRun->>Exporter: start (context)
    EvalRun->>CallbackMgr: on_dataset_loaded(dataset, items)
    CallbackMgr->>LangSmith: create dataset/examples
    EvalRun->>EvalRun: run evaluators -> produce EvalResult
    EvalRun->>CallbackMgr: on_eval_complete(EvalResult)
    CallbackMgr->>LangSmith: match OTEL runs, link runs, attach feedback
    Exporter->>Exporter: stop
Loading
sequenceDiagram
    participant CLI as Optimizer CLI
    participant Registry as GlobalTypeRegistry
    participant OptMgr as OptimizerCallbackManager
    participant ParamOpt as ParameterOptimizer
    participant PromptOpt as PromptOptimizer
    participant LangSmith as LangSmith API

    CLI->>Registry: query optimizer callback factories
    Registry-->>OptMgr: instantiate callbacks
    CLI->>ParamOpt: optimize_parameters(callback_manager=OptMgr)
    ParamOpt->>OptMgr: pre_create_experiment(dataset_items)
    loop per trial
      ParamOpt->>OptMgr: on_trial_end(TrialResult)
      OptMgr->>LangSmith: push prompts / link runs
    end
    ParamOpt->>OptMgr: on_study_end(best_trial, total_trials)
    ParamOpt->>PromptOpt: optimize_prompts(callback_manager=OptMgr, trial_offset)
    PromptOpt->>OptMgr: on_trial_end(TrialResult)
    OptMgr->>LangSmith: push prompts / tag best
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title concisely and descriptively captures the main change: integrating LangSmith observability with evaluation and optimization workflows. It uses imperative mood and is well within the 72-character guideline at 66 characters.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into develop

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread examples/RAG/simple_rag/pyproject.toml
Comment thread packages/nvidia_nat_core/src/nat/cli/register_workflow.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py Outdated
@pastorsj pastorsj marked this pull request as ready for review February 13, 2026 04:44
@pastorsj pastorsj requested review from a team as code owners February 13, 2026 04:44
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)

199-220: 🛠️ Refactor suggestion | 🟠 Major

pick_trial is called twice with identical arguments.

pick_trial(study, mode=..., weights=...) is invoked on line 199 for best_params and again on line 207 for the callback. Depending on study state this is at minimum wasteful, and for non-deterministic modes it could even select a different trial. Reuse the result.

Proposed fix
     best_trial_obj = pick_trial(
         study=study,
         mode=optimizer_config.multi_objective_combination_mode,
         weights=weights,
     )
-    best_params = best_trial_obj.params
+    best_params = best_trial_obj.params

     if callback_manager is not None:
         from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult
-        best_trial_obj = pick_trial(
-            study=study,
-            mode=optimizer_config.multi_objective_combination_mode,
-            weights=weights,
-        )
         callback_manager.on_study_end(
             best_trial=TrialResult(
                 trial_number=best_trial_obj.number,
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)

146-165: ⚠️ Potential issue | 🔴 Critical

NameError: _numeric_trial_count is undefined when numeric optimization is disabled.

If base_cfg.optimizer.numeric.enabled is False, the assignment at line 148 is skipped, but line 164 still references _numeric_trial_count. This will crash with a NameError when only prompt optimization is enabled.

Proposed fix
     # ---------------- 3. numeric / enum tuning ----------- #
     tuned_cfg = base_cfg
     best_numeric_params: dict = {}
+    _numeric_trial_count: int = 0
     if base_cfg.optimizer.numeric.enabled:
         tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-eval.yml`:
- Around line 43-44: The multi-line prompt string contains an extra leading
indentation on the line starting with "suspicious signals..." which introduces
unintended spaces into the prompt; open the YAML value that begins with "Examine
the following email content..." and remove the four leading spaces from the
"suspicious signals..." line so it lines up with the other lines in that block
scalar, ensuring the block's indentation remains consistent and the prompt text
has no stray leading whitespace.
- Around line 82-88: The YAML under the "dataset" key has incorrect indentation:
keys like _type, file_path, id_key and the nested structure/question_key and
structure/answer_key are indented too far; fix by reducing their indentation so
they are two spaces deeper than the "dataset:" parent (i.e., align _type,
file_path, id_key at one extra indent level under dataset and indent structure's
children question_key and answer_key one level further), ensuring consistent
2-space YAML nesting for dataset, _type, file_path, id_key, structure,
question_key, and answer_key.

In `@packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py`:
- Around line 175-189: The loop currently calls manager.get_eval_project_name()
and sets exporter_config.project and config.config_file inside each iteration
(within the for _name, exporter_config in tracing loop), which prevents
later-registered callbacks from contributing the project name and repeatedly
patches config; move the eval project resolution and config patching out of the
loop: keep the registration steps (registry.get_eval_callback(...),
registered.factory_fn(...), manager.register(cb)) inside the loop, then after
the loop call eval_project = manager.get_eval_project_name(), and if
eval_project and hasattr(exporter_config, 'project') set exporter_config.project
= eval_project and set config.config_file = loaded exactly once; use the same
identifiers (tracing, registry.get_eval_callback, registered.factory_fn,
manager.register, manager.get_eval_project_name, exporter_config.project,
config.config_file, loaded) to locate and change the code.

In `@packages/nvidia_nat_core/src/nat/cli/register_workflow.py`:
- Around line 786-813: Add explicit type hints to both inner decorator functions
by annotating the fn parameter and return type (e.g., change def
register_inner(fn): to def register_inner(fn: Callable[..., Any]) ->
Callable[..., Any]:) in register_eval_callback and register_optimizer_callback;
also add the necessary import (from typing import Any, Callable) at the top of
the file if not present so static typing requirements for public APIs are
satisfied, and ensure you apply the same annotation pattern to both
register_inner closures (the ones used with RegisteredEvalCallback and
RegisteredOptimizerCallback).

In `@packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py`:
- Around line 82-91: The get_eval_project_name method currently swallows all
exceptions from callbacks (self._callbacks) which hides real errors; update the
except block in get_eval_project_name to log the exception details (including
which callback instance raised it) at debug level instead of silently passing —
e.g., capture the exception as e and call a logger.debug with a clear message
and the exception (use an existing logger on the module/class or
logging.getLogger(__name__)), then continue to the next callback.

In `@packages/nvidia_nat_core/src/nat/eval/evaluate.py`:
- Line 632: Remove the redundant local import "from pathlib import Path" in
evaluate.py (the duplicate of the module-level import of Path); locate the
duplicate import near the evaluate logic and delete that line so the
module-level Path import (already present) is used everywhere.

In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py`:
- Around line 44-51: The try/except around resolving opt_dataset_name silently
swallows errors; update the except block in optimizer_runtime.py (the block that
reads base_cfg.eval.general.dataset/ds_cfg and file_path to set
opt_dataset_name) to log the exception instead of pass—use an existing module
logger (e.g., logger.exception(...) or logger.debug(...) with traceback) so
failures in retrieving ds_cfg or Path(file_path).stem are recorded and aid
debugging.
- Around line 83-96: The file reading in optimizer_runtime.py (the block that
builds EvalInputItem entries when checking fp.suffix == '.csv' and fp.suffix ==
'.json') uses open(fp) without an explicit encoding; update both open calls to
specify encoding="utf-8" (for CSV also include newline='' if using
csv.DictReader) so that reading into rows and creating EvalInputItem (using
id_key, q_key, a_key, dataset_items) is consistent and locale-independent.

In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py`:
- Line 52: The function currently annotated as "-> Config" but actually returns
a 3-tuple (tuned_cfg, dict(best_params), optimizer_config.numeric.n_trials);
update the function signature to a matching type annotation (e.g., ->
Tuple[Config, Dict[str, Any], int]) and add the necessary typing imports (Tuple,
Dict, Any) at top, or alternatively change the return to a single Config if that
matches the intended API; reference the returned symbols tuned_cfg, best_params,
and optimizer_config.numeric.n_trials when making the change so the annotation
matches the actual return value.
- Line 141: The comment "Persist raw per‑repetition scores so they appear in
`trials_dataframe`." contains a NON-BREAKING HYPHEN (U+2011); replace it with a
regular hyphen-minus (-) so the comment reads "Persist raw per-repetition scores
so they appear in `trials_dataframe`." and save the change in
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py
ensuring the file remains UTF-8 encoded.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py`:
- Around line 97-100: The loop currently sleeps at the top of each retry
(time.sleep(retry_delay)) causing an unnecessary delay before the first check;
modify the retry logic in langsmith_evaluation_callback.py (inside the loop that
uses total_matched, attempt, max_retries, retry_delay) so that the sleep is
skipped on the first iteration (e.g., only call time.sleep(retry_delay) when
attempt > 1) or move the sleep to the end of the loop so the code checks for
runs immediately on the initial attempt and only waits between subsequent
retries.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 107-128: Both _ensure_dataset and pre_create_experiment duplicate
dataset creation and example population logic; extract a shared helper (e.g.,
_create_dataset_with_examples) that accepts the iterable of items and a small
adapter/caller to normalize item fields (id vs item_id and expected_output vs
expected_output_obj) so both methods call the same routine. The helper should
check self._dataset_id, call self._get_run_number and _humanize_dataset_name,
invoke self._client.create_dataset and loop items to call
self._client.create_example while populating self._example_ids and setting
self._dataset_id and self._dataset_name; update _ensure_dataset and
pre_create_experiment to pass the appropriate adapter or lambda to map each item
to (item_id_str, input_obj_str, expected_output_str) before delegating to the
new helper.
- Around line 253-259: Replace the silent broad except in the prompt-listing
loop with explicit exception handling: catch Exception as e and log a warning
that includes the exception and context (e.g., the `base` query) before
continuing, so failures in `self._client.list_prompts(...)` won’t be silently
swallowed; reference `self._client.list_prompts`, `pattern.match`,
`prompt.repo_handle`, and `max_run` when adding the message (use
`self._logger.warning(...)` or the module logger if `self._logger` is not
available) and include the exception info/stacktrace.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py`:
- Around line 144-183: The test test_on_trial_end_links_otel_runs is slow
because _link_otel_runs -> _match_and_link_otel_runs calls time.sleep(10.0);
patch time.sleep in the test to a no-op to avoid the 10s delay. Fix by adding a
unittest.mock.patch (or pytest monkeypatch) around the call to
opt_cb.on_trial_end to replace time.sleep with a lambda or function that returns
immediately (e.g., with patch("time.sleep", return_value=None) or
patch.object(the_module, "time.sleep", return_value=None) if imported locally),
ensuring the patch targets the same time.sleep used by _match_and_link_otel_runs
so the test runs fast while still asserting
update_run/create_feedback/create_run behavior.
- Around line 84-119: The test is slow because _match_and_link_otel_runs (in
langsmith_evaluation_callback.py) calls time.sleep with the default retry_delay;
patch time.sleep in the test to a no-op so retries don't block. In the
test_on_eval_complete_links_otel_runs test, mock or monkeypatch time.sleep
(e.g., monkeypatch.setattr(time, "sleep", lambda _: None) or use
unittest.mock.patch for time.sleep) before calling eval_cb.on_eval_complete so
the retries run instantly and the assertions still execute; keep the patch
scoped to the test so other tests are unaffected.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 130-181: The optimizer integration tests
(test_optimizer_callback_creates_trial_runs_and_summary and its sibling test)
create LangSmith projects/runs via LangSmithOptimizationCallback and
langsmith_client but do not clean them up, causing orphaned resources; wrap the
test body in a try/finally (or add a teardown fixture) that calls
langsmith_client.delete_project(langsmith_project_name) or otherwise deletes the
created runs/projects in the finally block to mirror the cleanup in
test_eval_callback_creates_dataset_runs_and_feedback; locate the setup using
LangSmithOptimizationCallback and the assertions that poll
langsmith_client.list_runs(...) and ensure deletion uses the same
langsmith_project_name and langsmith_client symbols.
🧹 Nitpick comments (21)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/register.py (1)

22-32: Add type hints to callback factory functions.

Per coding guidelines, all parameters and return values should have type hints. Both config and dataset_name are untyped, and neither function has a return type annotation.

♻️ Proposed fix
-@register_eval_callback(config_type=LangsmithTelemetryExporter)
-def _build_langsmith_eval_callback(config, **kwargs):
-    from .langsmith_evaluation_callback import LangSmithEvaluationCallback
-
-    return LangSmithEvaluationCallback(project=config.project)
-
-
-@register_optimizer_callback(config_type=LangsmithTelemetryExporter)
-def _build_langsmith_optimizer_callback(config, *, dataset_name=None, **kwargs):
-    from .langsmith_optimization_callback import LangSmithOptimizationCallback
-
-    return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name)
+@register_eval_callback(config_type=LangsmithTelemetryExporter)
+def _build_langsmith_eval_callback(config: LangsmithTelemetryExporter, **kwargs) -> "LangSmithEvaluationCallback":
+    from .langsmith_evaluation_callback import LangSmithEvaluationCallback
+
+    return LangSmithEvaluationCallback(project=config.project)
+
+
+@register_optimizer_callback(config_type=LangsmithTelemetryExporter)
+def _build_langsmith_optimizer_callback(
+    config: LangsmithTelemetryExporter, *, dataset_name: str | None = None, **kwargs
+) -> "LangSmithOptimizationCallback":
+    from .langsmith_optimization_callback import LangSmithOptimizationCallback
+
+    return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name)

As per coding guidelines: "Python methods should use type hints for all parameters and return values."

examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-optimize.yml (1)

89-95: Inconsistent indentation under dataset:.

The dataset: children use a 4-space indent step (8 spaces total) while the rest of the file consistently uses 2-space steps. This won't break YAML parsing but deviates from the repo convention.

♻️ Proposed fix
     dataset:
-        _type: csv
-        file_path: examples/evaluation_and_profiling/email_phishing_analyzer/data/smaller_test.csv
-        id_key: "subject"
-        structure:
-          question_key: body
-          answer_key: label
+      _type: csv
+      file_path: examples/evaluation_and_profiling/email_phishing_analyzer/data/smaller_test.csv
+      id_key: "subject"
+      structure:
+        question_key: body
+        answer_key: label

Based on learnings: "In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation."

packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (3)

54-56: Swallowed exceptions lose stack trace context.

When catching exceptions without re-raising, the coding guidelines require logger.exception() (or at least exc_info=True) to capture the full stack trace. Lines 55 and 66 use bare logger.debug(), which discards the exception details and makes debugging API failures harder.

♻️ Proposed fix
     except Exception:
-        logger.debug("Could not link run %s to example %s", run.id, example_id)
+        logger.debug("Could not link run %s to example %s", run.id, example_id, exc_info=True)
         return False
     for metric_name, score in item.scores.items():
         try:
             ...
         except Exception:
-            logger.debug("Could not attach feedback %s to run %s", metric_name, run.id)
+            logger.debug("Could not attach feedback %s to run %s", metric_name, run.id, exc_info=True)

As per coding guidelines: "When catching and logging exceptions without re-raising: always use logger.exception() to capture the full stack trace information."

Also applies to: 65-66


192-207: get_eval_project_name iterates all LangSmith projects to find the next run number.

self._client.list_projects() fetches every project in the workspace. For accounts with many projects, this will be slow and wasteful. Consider filtering by name prefix if the LangSmith SDK supports it, or using a search/query parameter.


138-147: Bidirectional substring matching can produce false positives with short or common input texts.

The condition item_input in run_input or run_input in item_input will spuriously match if one string is short (e.g., "1" would match any run containing "1"). Given that evaluation inputs are typically full questions this may be unlikely in practice, but consider adding a minimum-length guard or preferring exact matching first and falling back to substring only for unmatched items.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (1)

86-90: Silent exception swallowing hides failures in get_trial_project_name.

The except Exception: pass discards all context, making it impossible to diagnose why a callback failed to return a project name. A logger.debug call would preserve troubleshooting ability while keeping the "try next callback" flow.

♻️ Proposed fix
                 try:
                     return fn(trial_number)
                 except Exception:
-                    pass
+                    logger.debug("Callback %s.get_trial_project_name failed", type(cb).__name__, exc_info=True)
         return None

Note: The analogous get_eval_project_name in EvalCallbackManager uses the same silent pass pattern — consider updating both for consistency.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)

179-179: Add strict=True to zip() calls for defensive length checking.

eval_metrics and avg_scores / best_trial_obj.values should always be the same length, but strict=True converts a silent data-loss bug into an immediate ValueError if they ever diverge.

Example for line 179
-                eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores)), items=cb_items)
+                eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores, strict=True)), items=cb_items)

Apply analogously to lines 187 and 216.

Also applies to: 187-187, 216-216


180-181: Use logger.exception() when catching without re-raising.

Per coding guidelines, when catching and logging exceptions without re-raising, use logger.exception() to capture the full stack trace instead of logger.warning(..., exc_info=True).

Proposed fix
             except Exception:
-                logger.warning("Failed to build EvalResult for optimizer callback", exc_info=True)
+                logger.exception("Failed to build EvalResult for optimizer callback")

As per coding guidelines: "When catching and logging exceptions without re-raising: always use logger.exception() to capture the full stack trace information."

packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (3)

164-172: _link_otel_runs re-invokes get_trial_project_name which attempts to re-create the project.

get_trial_project_name on line 172 calls self._client.create_project(...) if self._dataset_id is set, relying on LangSmithConflictError to deduplicate. This is harmless but wasteful — consider computing the project name string directly (it's deterministic) or caching the result per trial.


331-348: Missing Google-style docstrings on public callback methods.

on_trial_end and on_study_end are public interface methods (part of the OptimizerCallback protocol). Per coding guidelines, all public functions require Google-style docstrings with a concise first line ending with a period.

Also applies to: 350-371


312-321: Reliance on private API _create_commit_tags for retroactive commit tagging is fragile.

self._client._create_commit_tags(...) is an internal/private method of the LangSmith client. While LangSmith v0.6.1+ provides a public API for tagging commits at creation time via create_commit(..., tags=...), the code here uses the private method to retroactively tag existing commits (when a prompt is unchanged or to mark the best trial's commit). This approach could break without notice on a minor version bump of the langsmith package. Consider pinning/bounding the langsmith version in your dependency spec to guard against such breakage, or check if a stable public API exists for retroactive commit tagging.

packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py (1)

29-30: Fixture functions should use the fixture_ prefix and define the name argument.

Per coding guidelines, pytest fixtures should define the name argument and use the fixture_ prefix. This applies to all four fixtures in this file.

Example for the `eval_cb` fixture
-    `@pytest.fixture`
-    def eval_cb(self):
+    `@pytest.fixture`(name="eval_cb")
+    def fixture_eval_cb(self):
         from nat.plugins.langchain.langsmith.langsmith_evaluation_callback import LangSmithEvaluationCallback
         return LangSmithEvaluationCallback(project="test-proj")

Apply analogously to mock_langsmith and opt_cb fixtures.

As per coding guidelines: "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the fixture_ prefix."

Also applies to: 38-41, 130-131, 139-142

packages/nvidia_nat_core/tests/nat/profiler/test_parameter_optimizer.py (1)

181-201: Consider extracting the repeated _DummyEvalRun into a module-level helper or fixture.

The same _DummyEvalRun class (with identical logic) is defined inside every test method. Extracting it to a module-level helper or a conftest.py fixture would reduce ~200 lines of duplication. As per coding guidelines, frequently repeated code should be extracted into pytest fixtures.

Also applies to: 291-300, 345-353, 402-411, 458-467, 512-521, 559-568, 613-621, 668-676, 717-726, 775-784, 834-843, 892-901, 946-955, 1004-1013, 1063-1072

packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)

190-190: Avoid accessing the private _callbacks attribute directly.

manager._callbacks breaks encapsulation. Consider adding a public has_callbacks property or __bool__ / __len__ to EvalCallbackManager.

Example

In EvalCallbackManager:

def __bool__(self) -> bool:
    return bool(self._callbacks)

Then here:

-        return manager if manager._callbacks else None
+        return manager if manager else None

159-159: Add return type annotation.

Per coding guidelines, type hints are required on parameters and return values.

Fix
-def _build_eval_callback_manager(config: EvaluationRunConfig):
+def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:

Note: since EvalCallbackManager is imported inside the function, you could use a from __future__ import annotations or a string literal "EvalCallbackManager | None".

packages/nvidia_nat_core/src/nat/eval/evaluate.py (2)

61-69: Add type annotation for callback_manager.

The parameter lacks a type hint. Per coding guidelines, all parameters should have type hints.

Fix
-    def __init__(self, config: EvaluationRunConfig, callback_manager=None):
+    def __init__(self, config: EvaluationRunConfig, callback_manager: Any | None = None):

Or, if you want to avoid the Any, use a forward reference or TYPE_CHECKING import for EvalCallbackManager:

+from __future__ import annotations
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from nat.eval.eval_callbacks import EvalCallbackManager
 ...
-    def __init__(self, config: EvaluationRunConfig, callback_manager=None):
+    def __init__(self, config: EvaluationRunConfig, callback_manager: EvalCallbackManager | None = None):

694-729: Duplicated EvalResultItem construction logic — also appears in prompt_optimizer.py.

The nested loop that correlates eval_output_items to eval_input_items and builds EvalResultItem objects is repeated almost verbatim in prompt_optimizer.py (lines 424-451). Consider extracting a shared helper (e.g., build_eval_result_from_output(evaluation_results, eval_input, usage_stats) -> EvalResult) to eliminate this duplication and reduce the risk of drift.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)

30-30: Add type hints for parameter and return value.

Per coding guidelines, all public APIs (and internal helpers in src/) should have type hints on parameters and return values.

Fix
-def _build_optimizer_callback_manager(base_cfg):
+def _build_optimizer_callback_manager(base_cfg: Any) -> OptimizerCallbackManager | None:

Use TYPE_CHECKING for the return type if needed to avoid circular imports.


64-65: Accessing private manager._callbacks — same issue as in evaluate.py.

Consider adding a public __bool__ or has_callbacks property to OptimizerCallbackManager and using it here instead.

packages/nvidia_nat_core/src/nat/cli/type_registry.py (1)

579-608: register_eval_callback / register_optimizer_callback don't call _registration_changed().

Every other register_* method in TypeRegistry calls self._registration_changed() after mutation. These two do not. This is likely fine since eval/optimizer callbacks don't affect Config model annotations, but the inconsistency could confuse future contributors. Consider adding the call for consistency, or adding a comment explaining why it's omitted.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)

416-465: Duplicated EvalResultItem construction — same logic as evaluate.py lines 694-729.

As noted in the evaluate.py review, this per-item score correlation and EvalResultItem building is copy-pasted. A shared helper would prevent divergence and simplify both call sites.

Additionally, the loop variable idx on line 420 is unused (Ruff B007). Use _ or remove enumerate.

Fix for unused variable
-                for idx, ind in enumerate(population):
+                for ind in population:

Comment thread packages/nvidia_nat_eval/src/nat/plugins/eval/cli/evaluate.py
Comment thread packages/nvidia_nat_core/src/nat/cli/register_workflow.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/eval/evaluate.py Outdated
Comment thread packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py Outdated
Comment thread packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py Outdated
@willkill07 willkill07 added feature request New feature or request non-breaking Non-breaking change labels Feb 13, 2026
Comment thread pyproject.toml Outdated
Comment thread packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)

156-166: ⚠️ Potential issue | 🔴 Critical

Bug: _numeric_trial_count is unbound when numeric optimization is disabled.

If base_cfg.optimizer.numeric.enabled is False, the optimize_parameters call at line 148 is skipped, so _numeric_trial_count is never assigned. When prompt optimization runs at line 164, this produces a NameError.

Proposed fix
     # ---------------- 3. numeric / enum tuning ----------- #
     tuned_cfg = base_cfg
     best_numeric_params: dict = {}
+    _numeric_trial_count = 0
     if base_cfg.optimizer.numeric.enabled:
         tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)

199-220: 🛠️ Refactor suggestion | 🟠 Major

Redundant pick_trial call.

pick_trial is called at line 199 to compute best_params, and then called again identically at line 207 inside the if callback_manager block. Reuse the trial object from the first call.

Proposed fix
-    best_params = pick_trial(
+    best_trial_obj = pick_trial(
         study=study,
         mode=optimizer_config.multi_objective_combination_mode,
         weights=weights,
-    ).params
+    )
+    best_params = best_trial_obj.params

     if callback_manager is not None:
         from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult
-        best_trial_obj = pick_trial(
-            study=study,
-            mode=optimizer_config.multi_objective_combination_mode,
-            weights=weights,
-        )
         callback_manager.on_study_end(
             best_trial=TrialResult(
🤖 Fix all issues with AI agents
In `@packages/nvidia_nat_core/src/nat/cli/register_workflow.py`:
- Around line 782-815: The two public decorator factories register_eval_callback
and register_optimizer_callback lack explicit return type annotations; update
their signatures to include explicit return types such as
Callable[[Callable[..., Any]], Callable[..., Any]] (and also annotate the inner
function register_inner similarly) so the public API has explicit typing. Modify
the signatures of register_eval_callback, register_optimizer_callback and their
inner register_inner functions to add these return type hints and import
typing.Callable/Any if not already imported.

In `@packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py`:
- Around line 82-91: Replace the current exception logging in
get_eval_project_name to match the other handlers: catch the exception from fn()
and call logger.exception with a descriptive message (e.g.,
"get_eval_project_name failed for %s" and type(cb).__name__) instead of
logger.debug(..., exc_info=True); this keeps behavior consistent with
on_dataset_loaded and on_eval_complete and ensures the full traceback is
captured when get_eval_project_name (method in callback objects accessed via fn
= getattr(cb, "get_eval_project_name", None)) raises.

In `@packages/nvidia_nat_core/src/nat/eval/evaluate.py`:
- Around line 61-70: The public constructor EvaluationRun.__init__ is missing a
Python 3.11+ type annotation for the callback_manager parameter; update the
signature to add an appropriate type hint (e.g., CallbackManager | None or the
concrete callback manager protocol/type used across the package) so
callback_manager: CallbackManager | None (or the exact type) is declared, and
ensure imports/forward-references are added if needed; keep the rest of the
constructor (self.callback_manager assignment and EvalConfig typing) unchanged.
- Around line 638-646: The except block around the callback dispatch should use
logger.exception to record the full stack trace instead of logger.warning(...,
exc_info=True); locate the try/except that calls
self.callback_manager.on_dataset_loaded (and the other analogous callback
try/except in the same module where logger.warning is used) and replace the
logger.warning(...) calls with logger.exception("Failed to fire
on_dataset_loaded callback") (or the corresponding message for the other
callback) so the exception and its traceback are captured.

In `@packages/nvidia_nat_core/src/nat/observability/exporter_manager.py`:
- Around line 257-259: The except block currently calls logger.exception("Error
during isolated exporter cleanup: %s", e) which redundantly passes the exception
object; change it to call logger.exception("Error during isolated exporter
cleanup") so logger.exception captures the current exception context
automatically—update the handler around the await
self._cleanup_isolated_exporters() call (referencing the same except block,
logger.exception, and the _cleanup_isolated_exporters method).

In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py`:
- Around line 82-91: The get_trial_project_name method currently swallows all
exceptions silently; update its except block to log the full exception (use
logger.exception) including context like the callback object (cb) and
trial_number, then continue to the next callback instead of silently passing;
locate the loop over self._callbacks and the local fn = getattr(cb,
"get_trial_project_name", None) in get_trial_project_name and replace the bare
except Exception: pass with a logger.exception(...) call that records the
callback and trial_number before proceeding, preserving stack traces and
avoiding duplicate re-raises.

In `@packages/nvidia_nat_core/tests/nat/data_models/test_optimizable.py`:
- Around line 375-377: Replace the broad exception assertion in
test_rejects_invalid_format to assert the specific Pydantic error: change the
pytest.raises(Exception) to pytest.raises(pydantic.ValidationError) (or from
pydantic import ValidationError and use ValidationError) so the test explicitly
expects the ValidationError raised by the SearchSpace model when created with
is_prompt=True, prompt="test", prompt_format="invalid"; ensure the test file
imports ValidationError from pydantic if not already present.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py`:
- Around line 29-41: The pytest fixtures in this file are not using the required
naming convention or decorator name argument: rename functions mock_langsmith ->
fixture_mock_langsmith and eval_cb -> fixture_eval_cb (and similarly cb_cls,
opt_cb) and update their decorators to include the name parameter (e.g.,
`@pytest.fixture`(name="mock_langsmith", autouse=True) for fixture_mock_langsmith
and `@pytest.fixture`(name="eval_cb") for fixture_eval_cb) so the fixture retains
its public name while the function uses the fixture_ prefix; ensure you update
all references/imports if any tests refer to the fixtures by function name.
🧹 Nitpick comments (12)
packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (2)

94-96: Use nat (or “NeMo Agent Toolkit”) instead of “NAT” in docstrings.

Keeps naming consistent with repo conventions.

♻️ Proposed fix
-def get_tracing_configs(config: Any) -> dict[str, Any]:
-    """Extract tracing configs from a loaded NAT config object."""
+def get_tracing_configs(config: Any) -> dict[str, Any]:
+    """Extract tracing configs from a loaded `nat` config object."""

As per coding guidelines: "Use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names."


30-95: Consider upgrading public API docstrings to Google style (Args/Returns).

The new public classes/functions are missing Google-style sections.

As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."

packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (1)

159-195: Add return type annotation and avoid accessing private _callbacks.

  1. Missing return type hint on _build_eval_callback_manager — per coding guidelines, all functions under packages/ require type hints on parameters and return values.
  2. Line 192 accesses manager._callbacks directly. Prefer exposing a public API on EvalCallbackManager (e.g., a __bool__ or has_callbacks property).
Proposed fix

In eval_callbacks.py, add a property to EvalCallbackManager:

`@property`
def has_callbacks(self) -> bool:
    return bool(self._callbacks)

Then in this file:

-def _build_eval_callback_manager(config: EvaluationRunConfig):
+def _build_eval_callback_manager(config: EvaluationRunConfig) -> "EvalCallbackManager | None":
     ...
-    return manager if manager._callbacks else None
+    return manager if manager.has_callbacks else None

As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."

packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (3)

224-239: list_projects() without filtering may be slow in large workspaces.

self._client.list_projects() at line 234 iterates over every project in the workspace to find the max run number. In large LangSmith workspaces with thousands of projects, this could be a significant bottleneck. Consider using a name-contains filter if the LangSmith SDK supports it (similar to list_datasets(dataset_name_contains=...) used in the optimization callback).

Also, import re at line 230 should be a module-level import rather than a lazy import inside a method — re is a stdlib module with no cost concern.


83-205: Matching logic is solid; minor optimization possible.

The retry-based OTEL run matching with longest-substring preference and cross-retry deduplication (matched_item_ids, processed_run_ids) is well-designed. One minor optimization: line 184 re-scans eval_result.items linearly for each match — a pre-built dict[str, item] keyed by item_id would be O(1).


241-276: on_dataset_loaded: missing docstring for public method.

Per coding guidelines, public methods should have Google-style docstrings. on_dataset_loaded and on_eval_complete are part of the callback interface but lack docstrings.

As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."

packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (1)

196-202: Add exc_info=True to debug log for full stack trace.

Line 202 logs at debug level but omits exc_info=True, so the exception details are lost.

Proposed fix
             except Exception:
-                logger.debug("Could not update project metadata for '%s'", trial_project)
+                logger.debug("Could not update project metadata for '%s'", trial_project, exc_info=True)

As per coding guidelines: "When catching and logging exceptions without re-raising: always use logger.exception() to capture the full stack trace information."

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)

30-116: Add return type annotation; avoid private attribute access.

  1. _build_optimizer_callback_manager at line 30 has no return type hint.
  2. Line 64 accesses manager._callbacks directly — same concern as in evaluate.py. Use a public API (e.g., has_callbacks property) on OptimizerCallbackManager.
Proposed fix
-def _build_optimizer_callback_manager(base_cfg):
+def _build_optimizer_callback_manager(base_cfg: Any) -> "OptimizerCallbackManager | None":

And for the private access:

-    if not manager._callbacks:
+    if not manager.has_callbacks:

As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)

145-191: EvalResult construction logic is duplicated in prompt_optimizer.py.

The block at lines 152-178 (building cb_items from eval_input_items with per-item scores, reasoning, and usage stats) is nearly identical to the same block in prompt_optimizer.py lines 425-452. Consider extracting a shared helper like build_eval_result_from_output(eval_output, eval_metrics, avg_scores) -> EvalResult to eliminate this duplication.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)

417-470: Unused loop variable and duplicated EvalResult construction.

  1. Line 421: idx is unused in the loop body — rename to _ to signal intent.
  2. The EvalResult construction block (lines 425-452) is nearly identical to the one in parameter_optimizer.py (lines 152-178). Consider extracting a shared helper as mentioned in the parameter_optimizer review.
Proposed fix for unused variable
-                for idx, ind in enumerate(population):
+                for _idx, ind in enumerate(population):

Or simply:

-                for idx, ind in enumerate(population):
+                for ind in population:
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (2)

42-51: OptimizerCallback Protocol is incomplete — missing duck-typed methods.

get_trial_project_name and set_prompt_param_names are invoked on callbacks via getattr duck-typing (lines 64, 85), but they're not declared in the OptimizerCallback Protocol. This makes the interface contract unclear for implementors.

Consider either adding these as optional protocol methods (with default implementations) or documenting that they're opt-in extensions.


30-39: Missing docstrings on TrialResult, OptimizerCallback, and OptimizerCallbackManager.

Per coding guidelines, public classes should have Google-style docstrings. TrialResult, OptimizerCallback, and OptimizerCallbackManager all lack them.

As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."

Comment thread packages/nvidia_nat_core/src/nat/cli/register_workflow.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py
Comment thread packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py Outdated
Comment thread packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py
Comment thread packages/nvidia_nat_core/src/nat/observability/exporter_manager.py Outdated
Comment thread packages/nvidia_nat_core/tests/nat/data_models/test_optimizable.py
@pastorsj pastorsj force-pushed the spastoriza/langsmith_optimizations branch from 827eaa5 to 4ecda16 Compare February 13, 2026 18:36
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)

144-166: ⚠️ Potential issue | 🔴 Critical

_numeric_trial_count is undefined when numeric optimization is disabled — NameError at runtime.

If base_cfg.optimizer.numeric.enabled is False, the if block on lines 147–154 is skipped and _numeric_trial_count is never assigned. Line 164 then raises a NameError.

Proposed fix
     # ---------------- 3. numeric / enum tuning ----------- #
     tuned_cfg = base_cfg
     best_numeric_params: dict = {}
+    _numeric_trial_count = 0
     if base_cfg.optimizer.numeric.enabled:
         tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-optimize.yml`:
- Around line 89-95: The YAML under the dataset key uses 4-space relative
indentation; update it to 2-space relative indentation to match the file style.
Specifically, re-indent the block keys file_path, id_key, and the nested
structure mapping (question_key, answer_key) so they are each two spaces further
than dataset: (use 2-space indents for file_path, id_key, structure and another
2-space indent for question_key and answer_key). After adjusting, validate the
YAML to ensure no syntax changes beyond indentation.

In `@packages/nvidia_nat_core/src/nat/observability/exporter_manager.py`:
- Around line 249-261: The ref-count decrement block has a TOCTOU race: after
releasing self._lock you call _cleanup_isolated_exporters() and stop(), allowing
another task calling start() to observe _running True and increment _ref_count
while you tear down; fix by performing lifecycle transition under the lock —
acquire self._lock, decrement self._ref_count, set should_stop = self._ref_count
== 0 and set self._running = False if should_stop, then (still under the lock)
prepare any data needed for cleanup and release the lock before awaiting actual
shutdown; to avoid deadlock, either call a new helper _stop_unlocked() (move
stop() internals into it) or inline the task-cancellation/cleanup logic so that
only non-awaiting state changes happen inside the lock and awaits
(_cleanup_isolated_exporters(), actual stop work) occur after releasing the
lock, ensuring no new callers can piggy-back on a doomed manager.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 196-202: The exception handler that wraps
self._client.update_project / self._client.read_project currently swallows
errors by calling logger.debug without exception info; update the except block
to log the full stack trace (e.g., replace logger.debug("Could not update
project metadata for '%s'", trial_project) with a logging call that includes
exception info such as logger.exception("Could not update project metadata for
'%s'", trial_project) or logger.debug(..., exc_info=True)) so the caught
exception details are preserved.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 42-127: The test function
test_eval_callback_creates_dataset_runs_and_feedback leaks the LangSmith dataset
if an assertion fails; wrap the dataset creation and subsequent assertions in a
try/finally so cleanup always runs: assign ds = None before creating the
dataset, then after creating/using it put the verification and eval steps inside
a try block and call langsmith_client.delete_dataset(dataset_id=ds.id) in the
finally block only if ds is not None. Keep references to the existing symbols:
test_eval_callback_creates_dataset_runs_and_feedback, ds,
langsmith_client.delete_dataset, and the
EvalCallbackManager/LangSmithEvaluationCallback usage.
🧹 Nitpick comments (13)
packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (1)

94-96: Chained getattr is fragile and hard to read.

The triple-nested getattr works but is brittle. Consider using a safer traversal:

♻️ Proposed improvement
 def get_tracing_configs(config: Any) -> dict[str, Any]:
     """Extract tracing configs from a loaded NAT config object."""
-    return getattr(getattr(getattr(config, 'general', None), 'telemetry', None), 'tracing', None) or {}
+    general = getattr(config, 'general', None)
+    telemetry = getattr(general, 'telemetry', None) if general else None
+    tracing = getattr(telemetry, 'tracing', None) if telemetry else None
+    return tracing or {}
packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)

159-159: Add return type hint.

The private helper is missing a return type annotation.

♻️ Proposed fix
-def _build_eval_callback_manager(config: EvaluationRunConfig):
+def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:

Since EvalCallbackManager is imported inside the function body, you can use a string literal or add the import at module level under TYPE_CHECKING:

+from __future__ import annotations

or

-def _build_eval_callback_manager(config: EvaluationRunConfig):
+def _build_eval_callback_manager(config: EvaluationRunConfig) -> "EvalCallbackManager | None":

192-192: Accessing private _callbacks from outside the class is a code smell.

manager._callbacks breaks encapsulation. Consider adding a public property or method to EvalCallbackManager:

♻️ Proposed fix in eval_callbacks.py
 class EvalCallbackManager:
 
     def __init__(self) -> None:
         self._callbacks: list[EvalCallback] = []
 
+    `@property`
+    def has_callbacks(self) -> bool:
+        return bool(self._callbacks)
+
     def register(self, callback: EvalCallback) -> None:

Then in evaluate.py:

-        return manager if manager._callbacks else None
+        return manager if manager.has_callbacks else None
packages/nvidia_nat_core/src/nat/eval/evaluate.py (1)

701-736: O(N × M × N) nested loop for building callback items.

The triple-nested iteration (input items × evaluators × output items) performs a linear scan to match items by ID. For large datasets this could be slow.

Consider pre-indexing output items by ID per evaluator for O(1) lookups:

♻️ Suggested optimization
                             cb_items = []
+                            # Pre-index evaluator outputs by item ID for O(1) lookup
+                            eval_output_index: dict[str, dict[str, Any]] = {}
+                            eval_reasoning_index: dict[str, dict[str, Any]] = {}
+                            for eval_name, eval_output in self.evaluation_results:
+                                score_map = {}
+                                reasoning_map = {}
+                                for output_item in eval_output.eval_output_items:
+                                    sid = str(output_item.id)
+                                    if isinstance(output_item.score, (int, float)):
+                                        score_map[sid] = float(output_item.score)
+                                    reasoning_map[sid] = output_item.reasoning
+                                eval_output_index[eval_name] = score_map
+                                eval_reasoning_index[eval_name] = reasoning_map
+
                             for input_item in self.eval_input.eval_input_items:
                                 per_item_scores = {}
                                 per_item_reasoning = {}
-                                for eval_name, eval_output in self.evaluation_results:
-                                    for output_item in eval_output.eval_output_items:
-                                        if str(output_item.id) == str(input_item.id):
-                                            score_val = output_item.score
-                                            if isinstance(score_val, (int, float)):
-                                                per_item_scores[eval_name] = float(score_val)
-                                            per_item_reasoning[eval_name] = output_item.reasoning
-                                            break
+                                item_sid = str(input_item.id)
+                                for eval_name in eval_output_index:
+                                    if item_sid in eval_output_index[eval_name]:
+                                        per_item_scores[eval_name] = eval_output_index[eval_name][item_sid]
+                                    if item_sid in eval_reasoning_index[eval_name]:
+                                        per_item_reasoning[eval_name] = eval_reasoning_index[eval_name][item_sid]
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/register.py (1)

21-32: Add type hints and docstrings to the builder functions.

Per coding guidelines, {src,packages}/**/*.py files require type hints on parameters and return values, and Google-style docstrings for public modules/functions. While these are private helpers, they serve as plugin entry points and would benefit from proper annotations for maintainability.

Suggested improvement
+"""LangSmith callback registration for evaluation and optimization workflows."""
+
 from nat.cli.register_workflow import register_eval_callback
 from nat.cli.register_workflow import register_optimizer_callback
 from nat.plugins.opentelemetry.register import LangsmithTelemetryExporter
 
 
 `@register_eval_callback`(config_type=LangsmithTelemetryExporter)
-def _build_langsmith_eval_callback(config, **kwargs):
+def _build_langsmith_eval_callback(config: LangsmithTelemetryExporter, **kwargs: Any) -> Any:
+    """Build a LangSmith evaluation callback from the telemetry exporter config."""
     from .langsmith_evaluation_callback import LangSmithEvaluationCallback
 
     return LangSmithEvaluationCallback(project=config.project)
 
 
 `@register_optimizer_callback`(config_type=LangsmithTelemetryExporter)
-def _build_langsmith_optimizer_callback(config, *, dataset_name=None, **kwargs):
+def _build_langsmith_optimizer_callback(config: LangsmithTelemetryExporter, *, dataset_name: str | None = None, **kwargs: Any) -> Any:
+    """Build a LangSmith optimization callback from the telemetry exporter config."""
     from .langsmith_optimization_callback import LangSmithOptimizationCallback
 
     return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name)

As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values" and "Provide Google-style docstrings for every public module, class, function and CLI command."

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)

30-31: Add a return type annotation to _build_optimizer_callback_manager.

Per coding guidelines, all functions in {src,packages}/**/*.py require type hints on parameters and return values.

Suggested fix
-def _build_optimizer_callback_manager(base_cfg):
+def _build_optimizer_callback_manager(base_cfg: Any) -> Any:

Or more precisely, once the import is available:

+from nat.profiler.parameter_optimization.optimizer_callbacks import OptimizerCallbackManager
-def _build_optimizer_callback_manager(base_cfg):
+def _build_optimizer_callback_manager(base_cfg: Any) -> OptimizerCallbackManager | None:

As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."


64-65: Accessing the private _callbacks attribute of OptimizerCallbackManager.

Consider adding a public method (e.g., has_callbacks property or __bool__) to OptimizerCallbackManager instead of reaching into its internals.

packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (1)

183-187: Silent except Exception: pass in cleanup blocks — consider logging.

Per coding guidelines, when catching exceptions without re-raising, use logger.exception() or logger.debug() to capture the stack trace. This aids debugging when cleanup unexpectedly fails.

Suggested fix
         except Exception:
-            pass
+            logger.debug("Cleanup failed", exc_info=True)

As per coding guidelines: "When catching and logging exceptions without re-raising: always use logger.exception() to capture the full stack trace information."

Also applies to: 233-237

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)

199-220: pick_trial is called twice with the same arguments.

pick_trial is invoked at line 199 to get best_params and again at line 207 for the callback. Depending on the cost of pick_trial, this is redundant — reuse the first result.

Suggested fix
-    best_params = pick_trial(
+    best_trial_obj = pick_trial(
         study=study,
         mode=optimizer_config.multi_objective_combination_mode,
         weights=weights,
     ).params
+    best_params = best_trial_obj.params
 
     if callback_manager is not None:
         from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult
-        best_trial_obj = pick_trial(
-            study=study,
-            mode=optimizer_config.multi_objective_combination_mode,
-            weights=weights,
-        )
         callback_manager.on_study_end(

145-191: EvalResult construction block is near-identical in prompt_optimizer.py.

The logic to build EvalResult/EvalResultItem from eval_output (lines 150–179) is duplicated in prompt_optimizer.py (lines 422–453). Consider extracting a shared helper (e.g., build_eval_result_from_output) to reduce duplication and ease future maintenance.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)

417-470: Unused loop variable idx on line 421.

Ruff flags idx (B007). Since it's not used in the loop body, prefix it with _.

Fix
-                for idx, ind in enumerate(population):
+                for _idx, ind in enumerate(population):
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (1)

224-239: get_eval_project_name iterates all projects to find the max Run #.

self._client.list_projects() may return a large number of projects in production environments. This could become a latency bottleneck. Consider filtering by name prefix if the LangSmith API supports it, or caching the result.

packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (1)

282-330: Template format detection is well-reasoned.

The detection priority is clearly documented, and the mustache-before-{# ordering correctly avoids the substring ambiguity. The intentional default to jinja2 for bare {{ is reasonable for the Python/LangChain ecosystem.

One note: lines 323–327 could be slightly simplified since both branches return "jinja2":

Optional simplification
         # 4. {{ }} present — disambiguate via expression keywords
         if "{{" in text:
-            if any(kw in text for kw in cls._JINJA2_EXPR_KEYWORDS):
-                return "jinja2"
-            # Plain {{ }} — default to jinja2 (more common in Python)
             return "jinja2"

However, the current form does serve as documentation of the detection reasoning, so keeping it explicit is also fine.

Comment thread packages/nvidia_nat_core/src/nat/observability/exporter_manager.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py`:
- Around line 66-70: The set_prompt_param_names method currently iterates
callbacks and calls set_prompt_param_names without protection, so an exception
in one callback will abort the loop; update set_prompt_param_names in the
callback manager to mirror other dispatchers by wrapping each per-callback
invocation in a try/except that catches exceptions from fn(names), logs or
handles the error consistently with existing patterns (use the same
logger/behavior used in other methods), and continues to the next callback
(refer to self._callbacks iteration and the getattr(cb,
"set_prompt_param_names", None) lookup to locate the exact spot to modify).

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 461-479: on_study_end currently fetches the most recent commit
(limit=1, offset=0) and may tag the wrong commit; instead locate the exact
commit for the best trial using the stored mapping
self._prompt_commit_urls[(param_name, best_trial.trial_number)] and call
self._client._create_commit_tags with that commit identifier (or URL) to add the
"best" tag; if the mapping is missing, re-push the prompt for that param/trial
via the existing _push_prompt(param_name, prompt_text, tags=["best"]) (or
equivalent) so the correct commit is created and tagged.
🧹 Nitpick comments (13)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (3)

322-327: Dead branch — keyword check has no effect on the outcome.

When {{ is present, both the keyword-match path and the fallback return "jinja2", making the any(kw in text ...) check superfluous. If the intent is to distinguish mustache {{ }} from jinja2, the keyword-absent path should return "mustache" (or at least document why both resolve to jinja2). As-is, it's misleading to future readers.

Proposed simplification (if both cases really are jinja2)
         # 4. {{ }} present — disambiguate via expression keywords
         if "{{" in text:
-            if any(kw in text for kw in cls._JINJA2_EXPR_KEYWORDS):
-                return "jinja2"
-            # Plain {{ }} — default to jinja2 (more common in Python)
             return "jinja2"

414-435: Inconsistent value types in prompt_urls return dict.

On success, prompt_urls[param_name] is a URL string (line 414). On conflict, it's the repo_name slug (line 418). On failure, it's the raw prompt_text (line 435). Downstream consumers of this dict cannot reliably distinguish these cases. Consider using a consistent type or a richer return structure (e.g., a dataclass/named tuple with status).


53-54: Missing docstrings on public methods.

set_prompt_param_names, on_trial_end, and on_study_end are public API methods but lack Google-style docstrings. As per coding guidelines, every public function requires a docstring with a concise first-line description ending with a period.

packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (2)

51-57: Protocol methods lack docstrings.

Per coding guidelines, public APIs should have Google-style docstrings. The EvalCallback protocol and its methods would benefit from brief docstrings describing when each hook is invoked and what the caller guarantees.


98-100: Deeply chained getattr is fragile and hard to read.

The triple-nested getattr works but is opaque. A small helper or sequential approach improves readability and debuggability.

♻️ Suggested refactor
 def get_tracing_configs(config: Any) -> dict[str, Any]:
     """Extract tracing configs from a loaded NAT config object."""
-    return getattr(getattr(getattr(config, 'general', None), 'telemetry', None), 'tracing', None) or {}
+    general = getattr(config, 'general', None)
+    telemetry = getattr(general, 'telemetry', None)
+    return getattr(telemetry, 'tracing', None) or {}
packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)

159-195: Missing return type annotation.

Per coding guidelines, all functions should use type hints for parameters and return values. _build_eval_callback_manager is missing a return type.

♻️ Proposed fix
-def _build_eval_callback_manager(config: EvaluationRunConfig):
+def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:

You'll need a TYPE_CHECKING guard for the import:

from __future__ import annotations
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from nat.eval.eval_callbacks import EvalCallbackManager

186-190: Side-effect mutation of config.config_file is subtle.

config.config_file = loaded (line 190) silently replaces the caller's Path with a loaded model object. This works because run_and_evaluate passes config to EvaluationRun, but it's a non-obvious side-effect. A brief inline comment explaining why this mutation is needed would help future maintainers.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)

30-31: Missing type annotations on _build_optimizer_callback_manager.

Per coding guidelines, all Python methods should have type hints for parameters and return values. The base_cfg parameter and the return type are untyped.

♻️ Proposed fix
-def _build_optimizer_callback_manager(base_cfg):
+def _build_optimizer_callback_manager(base_cfg: Config) -> OptimizerCallbackManager | None:

Use TYPE_CHECKING guards as needed to avoid circular imports.


44-51: Duplicate extraction of ds_cfg / file_path.

The dataset config path is resolved identically at lines 46–49 and again at lines 75–78. Extract this into a local helper or resolve once before the loop.

♻️ Suggested refactor
+        # Resolve dataset config once
+        ds_cfg = None
+        file_path = None
+        try:
+            ds_cfg = base_cfg.eval.general.dataset
+            file_path = getattr(ds_cfg, 'file_path', None)
+        except Exception:
+            logger.debug("Could not extract dataset config", exc_info=True)
+
         # Extract dataset name from eval config (runtime concern, not plugin-specific)
         opt_dataset_name = None
-        try:
-            ds_cfg = base_cfg.eval.general.dataset
-            file_path = getattr(ds_cfg, 'file_path', None)
-            if file_path:
-                opt_dataset_name = Path(file_path).stem
-        except Exception:
-            logger.debug("Could not extract dataset name from config", exc_info=True)
+        if file_path:
+            opt_dataset_name = Path(file_path).stem
         ...
         # Pre-create experiments ...
         try:
-            ds_cfg = base_cfg.eval.general.dataset
-            file_path = getattr(ds_cfg, 'file_path', None)
             if file_path:
                 fp = Path(file_path)

Also applies to: 75-78

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)

145-191: Use strict=True in zip() calls to catch length mismatches early.

eval_metrics and avg_scores should always have the same length, but zip(..., strict=True) makes this assumption explicit and fails fast if violated. Applies to lines 179, 187, and 212.

♻️ Proposed fix (line 179 example; apply similarly to 187 and 212)
-                eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores)), items=cb_items)
+                eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores, strict=True)), items=cb_items)

152-178: O(n × m) item-to-score matching via nested loops.

For each input_item, the inner loop scans all eval_output_items of every evaluator to find a matching ID. This is quadratic in the number of items × evaluators. For typical eval datasets this is fine, but if datasets grow, pre-indexing eval_output_items by ID into a dict would be more efficient.

packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (2)

42-51: set_prompt_param_names and get_trial_project_name are duck-typed but absent from the Protocol.

The OptimizerCallback protocol defines pre_create_experiment, on_trial_end, and on_study_end, but the manager also dispatches set_prompt_param_names and get_trial_project_name via getattr. This makes the callback contract incomplete — implementors won't know about these optional hooks from the Protocol alone. Consider either adding them to the Protocol (with default no-op implementations or as optional methods) or documenting the duck-typed extension points in the class/module docstring.

Also applies to: 66-70, 86-95


30-39: TrialResult and OptimizerCallback lack Google-style docstrings.

Per coding guidelines, public classes should have docstrings. TrialResult fields are well-commented inline, but a class-level docstring describing its purpose would complete the documentation.

@pastorsj pastorsj force-pushed the spastoriza/langsmith_optimizations branch 2 times, most recently from 5853f04 to 4f597d6 Compare February 15, 2026 14:19
Comment thread packages/nvidia_nat_vanna/uv.lock
…callbacks

Add LangSmith integration for NAT evaluation and optimization flows:

- Eager trace linking: pre-generate OTEL root span_ids in the eval loop
  so LangSmith run_ids can be derived deterministically (span_id → UUID
  mapping), enabling immediate update_run()/create_feedback() calls
  without waiting for list_runs() indexing (10-40s+ delay eliminated).
  Falls back to substring matching when eager linking fails.

- Evaluation callback: creates LangSmith datasets/experiments, links
  OTEL traces to dataset examples, attaches evaluator feedback scores.

- Optimization callback: per-trial experiment projects with OTEL trace
  linking, prompt version management via LangSmith prompt repos, and
  dynamic retry budget for the substring-matching fallback path.

- Scoped via needs_root_span_ids opt-in so non-LangSmith exporters
  are completely unaffected by the core package changes.

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@pastorsj pastorsj force-pushed the spastoriza/langsmith_optimizations branch from c36fc35 to 7e8b324 Compare February 20, 2026 06:50
Comment thread packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/observability/exporter_manager.py
pastorsj and others added 3 commits February 20, 2026 13:40
Revert ExporterManager to per-item isolated lifecycle. Each eval item
starts/stops its own exporter independently. Removes ref counting,
shared Subject forwarding, and the eval loop pre-start wrapper.
Restores unconditional wait_for_all_export_tasks_local before callbacks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Tracing config extraction is an observability concern, not eval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
…ntation

Signed-off-by: Santiago Pastoriza <spastoriza@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Comment thread packages/nvidia_nat_core/src/nat/parameter_optimization/parameter_optimizer.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/parameter_optimization/parameter_optimizer.py Outdated
Comment thread packages/nvidia_nat_core/src/nat/parameter_optimization/prompt_optimizer.py Outdated
Comment thread packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py Outdated
@pastorsj pastorsj force-pushed the spastoriza/langsmith_optimizations branch from 6ed3e23 to 9b4ad9a Compare February 21, 2026 20:45
pastorsj and others added 3 commits February 21, 2026 16:44
Address PR review "extract to own method" nits by pulling inline
callback code into dedicated functions:

- build_eval_result() in eval_callbacks.py — single helper replacing
  the ~25-line EvalResult construction block duplicated in 3 callers
- _fire_on_eval_complete() on EvaluationRun in evaluate.py
- _fire_numeric_trial_end() / _fire_numeric_study_end() in
  parameter_optimizer.py
- _fire_prompt_trial_end_callbacks() / _fire_prompt_study_end() in
  prompt_optimizer.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@pastorsj pastorsj force-pushed the spastoriza/langsmith_optimizations branch from aa1f798 to 31e5615 Compare February 22, 2026 15:27
Copy link
Copy Markdown
Contributor

@dnandakumar-nv dnandakumar-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve pending resoltion of minor nits below. Also need a review from @NVIDIA/nat-dep-approvers due to pyproject.toml change.

Alternatively, @pastorsj I think you can remove the langsmith library dependency from the pyrpoject.toml as it is already implicitly downloaded given our current dependencies.

Comment thread packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py Outdated
dnandakumar-nv and others added 2 commits February 22, 2026 16:33
- Add type hint for callback_manager param in EvaluationRun.__init__
- Deduplicate _ensure_dataset and pre_create_experiment into shared
  _create_dataset_with_examples helper
- Remove langsmith from explicit dependencies (transitive via langchain-core)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test 0abb128

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test 11a9bfd

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test 1248bab

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test 3e3ca58

Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test f6ead01

pastorsj and others added 2 commits February 23, 2026 12:36
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/ok to test 962d799

@dnandakumar-nv
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit bd6f4cf into NVIDIA:develop Feb 23, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants