Integrate LangSmith Observability with Evaluation and Optimization#1593
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds pluggable evaluation and optimizer callback frameworks, wires callbacks into evaluation and numeric/prompt optimization flows with per-trial tracing, implements LangSmith evaluation/optimization callbacks and registration, updates exporter lifecycle and data models, and includes extensive unit and integration tests. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/Runner
participant Registry as GlobalTypeRegistry
participant CallbackMgr as EvalCallbackManager
participant EvalRun as EvaluationRun
participant Exporter as TelemetryExporter
participant LangSmith as LangSmith API
CLI->>Registry: query eval callback factories
Registry-->>CallbackMgr: instantiate callbacks
CLI->>EvalRun: create with callback_manager
EvalRun->>Exporter: start (context)
EvalRun->>CallbackMgr: on_dataset_loaded(dataset, items)
CallbackMgr->>LangSmith: create dataset/examples
EvalRun->>EvalRun: run evaluators -> produce EvalResult
EvalRun->>CallbackMgr: on_eval_complete(EvalResult)
CallbackMgr->>LangSmith: match OTEL runs, link runs, attach feedback
Exporter->>Exporter: stop
sequenceDiagram
participant CLI as Optimizer CLI
participant Registry as GlobalTypeRegistry
participant OptMgr as OptimizerCallbackManager
participant ParamOpt as ParameterOptimizer
participant PromptOpt as PromptOptimizer
participant LangSmith as LangSmith API
CLI->>Registry: query optimizer callback factories
Registry-->>OptMgr: instantiate callbacks
CLI->>ParamOpt: optimize_parameters(callback_manager=OptMgr)
ParamOpt->>OptMgr: pre_create_experiment(dataset_items)
loop per trial
ParamOpt->>OptMgr: on_trial_end(TrialResult)
OptMgr->>LangSmith: push prompts / link runs
end
ParamOpt->>OptMgr: on_study_end(best_trial, total_trials)
ParamOpt->>PromptOpt: optimize_prompts(callback_manager=OptMgr, trial_offset)
PromptOpt->>OptMgr: on_trial_end(TrialResult)
OptMgr->>LangSmith: push prompts / tag best
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)
199-220: 🛠️ Refactor suggestion | 🟠 Major
pick_trialis called twice with identical arguments.
pick_trial(study, mode=..., weights=...)is invoked on line 199 forbest_paramsand again on line 207 for the callback. Depending on study state this is at minimum wasteful, and for non-deterministic modes it could even select a different trial. Reuse the result.Proposed fix
best_trial_obj = pick_trial( study=study, mode=optimizer_config.multi_objective_combination_mode, weights=weights, ) - best_params = best_trial_obj.params + best_params = best_trial_obj.params if callback_manager is not None: from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult - best_trial_obj = pick_trial( - study=study, - mode=optimizer_config.multi_objective_combination_mode, - weights=weights, - ) callback_manager.on_study_end( best_trial=TrialResult( trial_number=best_trial_obj.number,packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)
146-165:⚠️ Potential issue | 🔴 Critical
NameError:_numeric_trial_countis undefined when numeric optimization is disabled.If
base_cfg.optimizer.numeric.enabledisFalse, the assignment at line 148 is skipped, but line 164 still references_numeric_trial_count. This will crash with aNameErrorwhen only prompt optimization is enabled.Proposed fix
# ---------------- 3. numeric / enum tuning ----------- # tuned_cfg = base_cfg best_numeric_params: dict = {} + _numeric_trial_count: int = 0 if base_cfg.optimizer.numeric.enabled: tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-eval.yml`:
- Around line 43-44: The multi-line prompt string contains an extra leading
indentation on the line starting with "suspicious signals..." which introduces
unintended spaces into the prompt; open the YAML value that begins with "Examine
the following email content..." and remove the four leading spaces from the
"suspicious signals..." line so it lines up with the other lines in that block
scalar, ensuring the block's indentation remains consistent and the prompt text
has no stray leading whitespace.
- Around line 82-88: The YAML under the "dataset" key has incorrect indentation:
keys like _type, file_path, id_key and the nested structure/question_key and
structure/answer_key are indented too far; fix by reducing their indentation so
they are two spaces deeper than the "dataset:" parent (i.e., align _type,
file_path, id_key at one extra indent level under dataset and indent structure's
children question_key and answer_key one level further), ensuring consistent
2-space YAML nesting for dataset, _type, file_path, id_key, structure,
question_key, and answer_key.
In `@packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py`:
- Around line 175-189: The loop currently calls manager.get_eval_project_name()
and sets exporter_config.project and config.config_file inside each iteration
(within the for _name, exporter_config in tracing loop), which prevents
later-registered callbacks from contributing the project name and repeatedly
patches config; move the eval project resolution and config patching out of the
loop: keep the registration steps (registry.get_eval_callback(...),
registered.factory_fn(...), manager.register(cb)) inside the loop, then after
the loop call eval_project = manager.get_eval_project_name(), and if
eval_project and hasattr(exporter_config, 'project') set exporter_config.project
= eval_project and set config.config_file = loaded exactly once; use the same
identifiers (tracing, registry.get_eval_callback, registered.factory_fn,
manager.register, manager.get_eval_project_name, exporter_config.project,
config.config_file, loaded) to locate and change the code.
In `@packages/nvidia_nat_core/src/nat/cli/register_workflow.py`:
- Around line 786-813: Add explicit type hints to both inner decorator functions
by annotating the fn parameter and return type (e.g., change def
register_inner(fn): to def register_inner(fn: Callable[..., Any]) ->
Callable[..., Any]:) in register_eval_callback and register_optimizer_callback;
also add the necessary import (from typing import Any, Callable) at the top of
the file if not present so static typing requirements for public APIs are
satisfied, and ensure you apply the same annotation pattern to both
register_inner closures (the ones used with RegisteredEvalCallback and
RegisteredOptimizerCallback).
In `@packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py`:
- Around line 82-91: The get_eval_project_name method currently swallows all
exceptions from callbacks (self._callbacks) which hides real errors; update the
except block in get_eval_project_name to log the exception details (including
which callback instance raised it) at debug level instead of silently passing —
e.g., capture the exception as e and call a logger.debug with a clear message
and the exception (use an existing logger on the module/class or
logging.getLogger(__name__)), then continue to the next callback.
In `@packages/nvidia_nat_core/src/nat/eval/evaluate.py`:
- Line 632: Remove the redundant local import "from pathlib import Path" in
evaluate.py (the duplicate of the module-level import of Path); locate the
duplicate import near the evaluate logic and delete that line so the
module-level Path import (already present) is used everywhere.
In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py`:
- Around line 44-51: The try/except around resolving opt_dataset_name silently
swallows errors; update the except block in optimizer_runtime.py (the block that
reads base_cfg.eval.general.dataset/ds_cfg and file_path to set
opt_dataset_name) to log the exception instead of pass—use an existing module
logger (e.g., logger.exception(...) or logger.debug(...) with traceback) so
failures in retrieving ds_cfg or Path(file_path).stem are recorded and aid
debugging.
- Around line 83-96: The file reading in optimizer_runtime.py (the block that
builds EvalInputItem entries when checking fp.suffix == '.csv' and fp.suffix ==
'.json') uses open(fp) without an explicit encoding; update both open calls to
specify encoding="utf-8" (for CSV also include newline='' if using
csv.DictReader) so that reading into rows and creating EvalInputItem (using
id_key, q_key, a_key, dataset_items) is consistent and locale-independent.
In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py`:
- Line 52: The function currently annotated as "-> Config" but actually returns
a 3-tuple (tuned_cfg, dict(best_params), optimizer_config.numeric.n_trials);
update the function signature to a matching type annotation (e.g., ->
Tuple[Config, Dict[str, Any], int]) and add the necessary typing imports (Tuple,
Dict, Any) at top, or alternatively change the return to a single Config if that
matches the intended API; reference the returned symbols tuned_cfg, best_params,
and optimizer_config.numeric.n_trials when making the change so the annotation
matches the actual return value.
- Line 141: The comment "Persist raw per‑repetition scores so they appear in
`trials_dataframe`." contains a NON-BREAKING HYPHEN (U+2011); replace it with a
regular hyphen-minus (-) so the comment reads "Persist raw per-repetition scores
so they appear in `trials_dataframe`." and save the change in
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py
ensuring the file remains UTF-8 encoded.
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py`:
- Around line 97-100: The loop currently sleeps at the top of each retry
(time.sleep(retry_delay)) causing an unnecessary delay before the first check;
modify the retry logic in langsmith_evaluation_callback.py (inside the loop that
uses total_matched, attempt, max_retries, retry_delay) so that the sleep is
skipped on the first iteration (e.g., only call time.sleep(retry_delay) when
attempt > 1) or move the sleep to the end of the loop so the code checks for
runs immediately on the initial attempt and only waits between subsequent
retries.
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 107-128: Both _ensure_dataset and pre_create_experiment duplicate
dataset creation and example population logic; extract a shared helper (e.g.,
_create_dataset_with_examples) that accepts the iterable of items and a small
adapter/caller to normalize item fields (id vs item_id and expected_output vs
expected_output_obj) so both methods call the same routine. The helper should
check self._dataset_id, call self._get_run_number and _humanize_dataset_name,
invoke self._client.create_dataset and loop items to call
self._client.create_example while populating self._example_ids and setting
self._dataset_id and self._dataset_name; update _ensure_dataset and
pre_create_experiment to pass the appropriate adapter or lambda to map each item
to (item_id_str, input_obj_str, expected_output_str) before delegating to the
new helper.
- Around line 253-259: Replace the silent broad except in the prompt-listing
loop with explicit exception handling: catch Exception as e and log a warning
that includes the exception and context (e.g., the `base` query) before
continuing, so failures in `self._client.list_prompts(...)` won’t be silently
swallowed; reference `self._client.list_prompts`, `pattern.match`,
`prompt.repo_handle`, and `max_run` when adding the message (use
`self._logger.warning(...)` or the module logger if `self._logger` is not
available) and include the exception info/stacktrace.
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py`:
- Around line 144-183: The test test_on_trial_end_links_otel_runs is slow
because _link_otel_runs -> _match_and_link_otel_runs calls time.sleep(10.0);
patch time.sleep in the test to a no-op to avoid the 10s delay. Fix by adding a
unittest.mock.patch (or pytest monkeypatch) around the call to
opt_cb.on_trial_end to replace time.sleep with a lambda or function that returns
immediately (e.g., with patch("time.sleep", return_value=None) or
patch.object(the_module, "time.sleep", return_value=None) if imported locally),
ensuring the patch targets the same time.sleep used by _match_and_link_otel_runs
so the test runs fast while still asserting
update_run/create_feedback/create_run behavior.
- Around line 84-119: The test is slow because _match_and_link_otel_runs (in
langsmith_evaluation_callback.py) calls time.sleep with the default retry_delay;
patch time.sleep in the test to a no-op so retries don't block. In the
test_on_eval_complete_links_otel_runs test, mock or monkeypatch time.sleep
(e.g., monkeypatch.setattr(time, "sleep", lambda _: None) or use
unittest.mock.patch for time.sleep) before calling eval_cb.on_eval_complete so
the retries run instantly and the assertions still execute; keep the patch
scoped to the test so other tests are unaffected.
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 130-181: The optimizer integration tests
(test_optimizer_callback_creates_trial_runs_and_summary and its sibling test)
create LangSmith projects/runs via LangSmithOptimizationCallback and
langsmith_client but do not clean them up, causing orphaned resources; wrap the
test body in a try/finally (or add a teardown fixture) that calls
langsmith_client.delete_project(langsmith_project_name) or otherwise deletes the
created runs/projects in the finally block to mirror the cleanup in
test_eval_callback_creates_dataset_runs_and_feedback; locate the setup using
LangSmithOptimizationCallback and the assertions that poll
langsmith_client.list_runs(...) and ensure deletion uses the same
langsmith_project_name and langsmith_client symbols.
🧹 Nitpick comments (21)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/register.py (1)
22-32: Add type hints to callback factory functions.Per coding guidelines, all parameters and return values should have type hints. Both
configanddataset_nameare untyped, and neither function has a return type annotation.♻️ Proposed fix
-@register_eval_callback(config_type=LangsmithTelemetryExporter) -def _build_langsmith_eval_callback(config, **kwargs): - from .langsmith_evaluation_callback import LangSmithEvaluationCallback - - return LangSmithEvaluationCallback(project=config.project) - - -@register_optimizer_callback(config_type=LangsmithTelemetryExporter) -def _build_langsmith_optimizer_callback(config, *, dataset_name=None, **kwargs): - from .langsmith_optimization_callback import LangSmithOptimizationCallback - - return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name) +@register_eval_callback(config_type=LangsmithTelemetryExporter) +def _build_langsmith_eval_callback(config: LangsmithTelemetryExporter, **kwargs) -> "LangSmithEvaluationCallback": + from .langsmith_evaluation_callback import LangSmithEvaluationCallback + + return LangSmithEvaluationCallback(project=config.project) + + +@register_optimizer_callback(config_type=LangsmithTelemetryExporter) +def _build_langsmith_optimizer_callback( + config: LangsmithTelemetryExporter, *, dataset_name: str | None = None, **kwargs +) -> "LangSmithOptimizationCallback": + from .langsmith_optimization_callback import LangSmithOptimizationCallback + + return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name)As per coding guidelines: "Python methods should use type hints for all parameters and return values."
examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-optimize.yml (1)
89-95: Inconsistent indentation underdataset:.The
dataset:children use a 4-space indent step (8 spaces total) while the rest of the file consistently uses 2-space steps. This won't break YAML parsing but deviates from the repo convention.♻️ Proposed fix
dataset: - _type: csv - file_path: examples/evaluation_and_profiling/email_phishing_analyzer/data/smaller_test.csv - id_key: "subject" - structure: - question_key: body - answer_key: label + _type: csv + file_path: examples/evaluation_and_profiling/email_phishing_analyzer/data/smaller_test.csv + id_key: "subject" + structure: + question_key: body + answer_key: labelBased on learnings: "In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation."
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (3)
54-56: Swallowed exceptions lose stack trace context.When catching exceptions without re-raising, the coding guidelines require
logger.exception()(or at leastexc_info=True) to capture the full stack trace. Lines 55 and 66 use barelogger.debug(), which discards the exception details and makes debugging API failures harder.♻️ Proposed fix
except Exception: - logger.debug("Could not link run %s to example %s", run.id, example_id) + logger.debug("Could not link run %s to example %s", run.id, example_id, exc_info=True) return False for metric_name, score in item.scores.items(): try: ... except Exception: - logger.debug("Could not attach feedback %s to run %s", metric_name, run.id) + logger.debug("Could not attach feedback %s to run %s", metric_name, run.id, exc_info=True)As per coding guidelines: "When catching and logging exceptions without re-raising: always use
logger.exception()to capture the full stack trace information."Also applies to: 65-66
192-207:get_eval_project_nameiterates all LangSmith projects to find the next run number.
self._client.list_projects()fetches every project in the workspace. For accounts with many projects, this will be slow and wasteful. Consider filtering by name prefix if the LangSmith SDK supports it, or using a search/query parameter.
138-147: Bidirectional substring matching can produce false positives with short or common input texts.The condition
item_input in run_input or run_input in item_inputwill spuriously match if one string is short (e.g.,"1"would match any run containing"1"). Given that evaluation inputs are typically full questions this may be unlikely in practice, but consider adding a minimum-length guard or preferring exact matching first and falling back to substring only for unmatched items.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (1)
86-90: Silent exception swallowing hides failures inget_trial_project_name.The
except Exception: passdiscards all context, making it impossible to diagnose why a callback failed to return a project name. Alogger.debugcall would preserve troubleshooting ability while keeping the "try next callback" flow.♻️ Proposed fix
try: return fn(trial_number) except Exception: - pass + logger.debug("Callback %s.get_trial_project_name failed", type(cb).__name__, exc_info=True) return NoneNote: The analogous
get_eval_project_nameinEvalCallbackManageruses the same silentpasspattern — consider updating both for consistency.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)
179-179: Addstrict=Truetozip()calls for defensive length checking.
eval_metricsandavg_scores/best_trial_obj.valuesshould always be the same length, butstrict=Trueconverts a silent data-loss bug into an immediateValueErrorif they ever diverge.Example for line 179
- eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores)), items=cb_items) + eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores, strict=True)), items=cb_items)Apply analogously to lines 187 and 216.
Also applies to: 187-187, 216-216
180-181: Uselogger.exception()when catching without re-raising.Per coding guidelines, when catching and logging exceptions without re-raising, use
logger.exception()to capture the full stack trace instead oflogger.warning(..., exc_info=True).Proposed fix
except Exception: - logger.warning("Failed to build EvalResult for optimizer callback", exc_info=True) + logger.exception("Failed to build EvalResult for optimizer callback")As per coding guidelines: "When catching and logging exceptions without re-raising: always use
logger.exception()to capture the full stack trace information."packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (3)
164-172:_link_otel_runsre-invokesget_trial_project_namewhich attempts to re-create the project.
get_trial_project_nameon line 172 callsself._client.create_project(...)ifself._dataset_idis set, relying onLangSmithConflictErrorto deduplicate. This is harmless but wasteful — consider computing the project name string directly (it's deterministic) or caching the result per trial.
331-348: Missing Google-style docstrings on public callback methods.
on_trial_endandon_study_endare public interface methods (part of theOptimizerCallbackprotocol). Per coding guidelines, all public functions require Google-style docstrings with a concise first line ending with a period.Also applies to: 350-371
312-321: Reliance on private API_create_commit_tagsfor retroactive commit tagging is fragile.
self._client._create_commit_tags(...)is an internal/private method of the LangSmith client. While LangSmith v0.6.1+ provides a public API for tagging commits at creation time viacreate_commit(..., tags=...), the code here uses the private method to retroactively tag existing commits (when a prompt is unchanged or to mark the best trial's commit). This approach could break without notice on a minor version bump of thelangsmithpackage. Consider pinning/bounding thelangsmithversion in your dependency spec to guard against such breakage, or check if a stable public API exists for retroactive commit tagging.packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py (1)
29-30: Fixture functions should use thefixture_prefix and define thenameargument.Per coding guidelines, pytest fixtures should define the
nameargument and use thefixture_prefix. This applies to all four fixtures in this file.Example for the `eval_cb` fixture
- `@pytest.fixture` - def eval_cb(self): + `@pytest.fixture`(name="eval_cb") + def fixture_eval_cb(self): from nat.plugins.langchain.langsmith.langsmith_evaluation_callback import LangSmithEvaluationCallback return LangSmithEvaluationCallback(project="test-proj")Apply analogously to
mock_langsmithandopt_cbfixtures.As per coding guidelines: "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the
fixture_prefix."Also applies to: 38-41, 130-131, 139-142
packages/nvidia_nat_core/tests/nat/profiler/test_parameter_optimizer.py (1)
181-201: Consider extracting the repeated_DummyEvalRuninto a module-level helper or fixture.The same
_DummyEvalRunclass (with identical logic) is defined inside every test method. Extracting it to a module-level helper or aconftest.pyfixture would reduce ~200 lines of duplication. As per coding guidelines, frequently repeated code should be extracted into pytest fixtures.Also applies to: 291-300, 345-353, 402-411, 458-467, 512-521, 559-568, 613-621, 668-676, 717-726, 775-784, 834-843, 892-901, 946-955, 1004-1013, 1063-1072
packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)
190-190: Avoid accessing the private_callbacksattribute directly.
manager._callbacksbreaks encapsulation. Consider adding a publichas_callbacksproperty or__bool__/__len__toEvalCallbackManager.Example
In
EvalCallbackManager:def __bool__(self) -> bool: return bool(self._callbacks)Then here:
- return manager if manager._callbacks else None + return manager if manager else None
159-159: Add return type annotation.Per coding guidelines, type hints are required on parameters and return values.
Fix
-def _build_eval_callback_manager(config: EvaluationRunConfig): +def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:Note: since
EvalCallbackManageris imported inside the function, you could use afrom __future__ import annotationsor a string literal"EvalCallbackManager | None".packages/nvidia_nat_core/src/nat/eval/evaluate.py (2)
61-69: Add type annotation forcallback_manager.The parameter lacks a type hint. Per coding guidelines, all parameters should have type hints.
Fix
- def __init__(self, config: EvaluationRunConfig, callback_manager=None): + def __init__(self, config: EvaluationRunConfig, callback_manager: Any | None = None):Or, if you want to avoid the
Any, use a forward reference orTYPE_CHECKINGimport forEvalCallbackManager:+from __future__ import annotations +from typing import TYPE_CHECKING +if TYPE_CHECKING: + from nat.eval.eval_callbacks import EvalCallbackManager ... - def __init__(self, config: EvaluationRunConfig, callback_manager=None): + def __init__(self, config: EvaluationRunConfig, callback_manager: EvalCallbackManager | None = None):
694-729: DuplicatedEvalResultItemconstruction logic — also appears inprompt_optimizer.py.The nested loop that correlates
eval_output_itemstoeval_input_itemsand buildsEvalResultItemobjects is repeated almost verbatim inprompt_optimizer.py(lines 424-451). Consider extracting a shared helper (e.g.,build_eval_result_from_output(evaluation_results, eval_input, usage_stats) -> EvalResult) to eliminate this duplication and reduce the risk of drift.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)
30-30: Add type hints for parameter and return value.Per coding guidelines, all public APIs (and internal helpers in
src/) should have type hints on parameters and return values.Fix
-def _build_optimizer_callback_manager(base_cfg): +def _build_optimizer_callback_manager(base_cfg: Any) -> OptimizerCallbackManager | None:Use
TYPE_CHECKINGfor the return type if needed to avoid circular imports.
64-65: Accessing privatemanager._callbacks— same issue as inevaluate.py.Consider adding a public
__bool__orhas_callbacksproperty toOptimizerCallbackManagerand using it here instead.packages/nvidia_nat_core/src/nat/cli/type_registry.py (1)
579-608:register_eval_callback/register_optimizer_callbackdon't call_registration_changed().Every other
register_*method inTypeRegistrycallsself._registration_changed()after mutation. These two do not. This is likely fine since eval/optimizer callbacks don't affectConfigmodel annotations, but the inconsistency could confuse future contributors. Consider adding the call for consistency, or adding a comment explaining why it's omitted.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)
416-465: DuplicatedEvalResultItemconstruction — same logic asevaluate.pylines 694-729.As noted in the
evaluate.pyreview, this per-item score correlation andEvalResultItembuilding is copy-pasted. A shared helper would prevent divergence and simplify both call sites.Additionally, the loop variable
idxon line 420 is unused (Ruff B007). Use_or removeenumerate.Fix for unused variable
- for idx, ind in enumerate(population): + for ind in population:
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)
156-166:⚠️ Potential issue | 🔴 CriticalBug:
_numeric_trial_countis unbound when numeric optimization is disabled.If
base_cfg.optimizer.numeric.enabledisFalse, theoptimize_parameterscall at line 148 is skipped, so_numeric_trial_countis never assigned. When prompt optimization runs at line 164, this produces aNameError.Proposed fix
# ---------------- 3. numeric / enum tuning ----------- # tuned_cfg = base_cfg best_numeric_params: dict = {} + _numeric_trial_count = 0 if base_cfg.optimizer.numeric.enabled: tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)
199-220: 🛠️ Refactor suggestion | 🟠 MajorRedundant
pick_trialcall.
pick_trialis called at line 199 to computebest_params, and then called again identically at line 207 inside theif callback_managerblock. Reuse the trial object from the first call.Proposed fix
- best_params = pick_trial( + best_trial_obj = pick_trial( study=study, mode=optimizer_config.multi_objective_combination_mode, weights=weights, - ).params + ) + best_params = best_trial_obj.params if callback_manager is not None: from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult - best_trial_obj = pick_trial( - study=study, - mode=optimizer_config.multi_objective_combination_mode, - weights=weights, - ) callback_manager.on_study_end( best_trial=TrialResult(
🤖 Fix all issues with AI agents
In `@packages/nvidia_nat_core/src/nat/cli/register_workflow.py`:
- Around line 782-815: The two public decorator factories register_eval_callback
and register_optimizer_callback lack explicit return type annotations; update
their signatures to include explicit return types such as
Callable[[Callable[..., Any]], Callable[..., Any]] (and also annotate the inner
function register_inner similarly) so the public API has explicit typing. Modify
the signatures of register_eval_callback, register_optimizer_callback and their
inner register_inner functions to add these return type hints and import
typing.Callable/Any if not already imported.
In `@packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py`:
- Around line 82-91: Replace the current exception logging in
get_eval_project_name to match the other handlers: catch the exception from fn()
and call logger.exception with a descriptive message (e.g.,
"get_eval_project_name failed for %s" and type(cb).__name__) instead of
logger.debug(..., exc_info=True); this keeps behavior consistent with
on_dataset_loaded and on_eval_complete and ensures the full traceback is
captured when get_eval_project_name (method in callback objects accessed via fn
= getattr(cb, "get_eval_project_name", None)) raises.
In `@packages/nvidia_nat_core/src/nat/eval/evaluate.py`:
- Around line 61-70: The public constructor EvaluationRun.__init__ is missing a
Python 3.11+ type annotation for the callback_manager parameter; update the
signature to add an appropriate type hint (e.g., CallbackManager | None or the
concrete callback manager protocol/type used across the package) so
callback_manager: CallbackManager | None (or the exact type) is declared, and
ensure imports/forward-references are added if needed; keep the rest of the
constructor (self.callback_manager assignment and EvalConfig typing) unchanged.
- Around line 638-646: The except block around the callback dispatch should use
logger.exception to record the full stack trace instead of logger.warning(...,
exc_info=True); locate the try/except that calls
self.callback_manager.on_dataset_loaded (and the other analogous callback
try/except in the same module where logger.warning is used) and replace the
logger.warning(...) calls with logger.exception("Failed to fire
on_dataset_loaded callback") (or the corresponding message for the other
callback) so the exception and its traceback are captured.
In `@packages/nvidia_nat_core/src/nat/observability/exporter_manager.py`:
- Around line 257-259: The except block currently calls logger.exception("Error
during isolated exporter cleanup: %s", e) which redundantly passes the exception
object; change it to call logger.exception("Error during isolated exporter
cleanup") so logger.exception captures the current exception context
automatically—update the handler around the await
self._cleanup_isolated_exporters() call (referencing the same except block,
logger.exception, and the _cleanup_isolated_exporters method).
In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py`:
- Around line 82-91: The get_trial_project_name method currently swallows all
exceptions silently; update its except block to log the full exception (use
logger.exception) including context like the callback object (cb) and
trial_number, then continue to the next callback instead of silently passing;
locate the loop over self._callbacks and the local fn = getattr(cb,
"get_trial_project_name", None) in get_trial_project_name and replace the bare
except Exception: pass with a logger.exception(...) call that records the
callback and trial_number before proceeding, preserving stack traces and
avoiding duplicate re-raises.
In `@packages/nvidia_nat_core/tests/nat/data_models/test_optimizable.py`:
- Around line 375-377: Replace the broad exception assertion in
test_rejects_invalid_format to assert the specific Pydantic error: change the
pytest.raises(Exception) to pytest.raises(pydantic.ValidationError) (or from
pydantic import ValidationError and use ValidationError) so the test explicitly
expects the ValidationError raised by the SearchSpace model when created with
is_prompt=True, prompt="test", prompt_format="invalid"; ensure the test file
imports ValidationError from pydantic if not already present.
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py`:
- Around line 29-41: The pytest fixtures in this file are not using the required
naming convention or decorator name argument: rename functions mock_langsmith ->
fixture_mock_langsmith and eval_cb -> fixture_eval_cb (and similarly cb_cls,
opt_cb) and update their decorators to include the name parameter (e.g.,
`@pytest.fixture`(name="mock_langsmith", autouse=True) for fixture_mock_langsmith
and `@pytest.fixture`(name="eval_cb") for fixture_eval_cb) so the fixture retains
its public name while the function uses the fixture_ prefix; ensure you update
all references/imports if any tests refer to the fixtures by function name.
🧹 Nitpick comments (12)
packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (2)
94-96: Usenat(or “NeMo Agent Toolkit”) instead of “NAT” in docstrings.Keeps naming consistent with repo conventions.
♻️ Proposed fix
-def get_tracing_configs(config: Any) -> dict[str, Any]: - """Extract tracing configs from a loaded NAT config object.""" +def get_tracing_configs(config: Any) -> dict[str, Any]: + """Extract tracing configs from a loaded `nat` config object."""As per coding guidelines: "Use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names."
30-95: Consider upgrading public API docstrings to Google style (Args/Returns).The new public classes/functions are missing Google-style sections.
As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."
packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (1)
159-195: Add return type annotation and avoid accessing private_callbacks.
- Missing return type hint on
_build_eval_callback_manager— per coding guidelines, all functions underpackages/require type hints on parameters and return values.- Line 192 accesses
manager._callbacksdirectly. Prefer exposing a public API onEvalCallbackManager(e.g., a__bool__orhas_callbacksproperty).Proposed fix
In
eval_callbacks.py, add a property toEvalCallbackManager:`@property` def has_callbacks(self) -> bool: return bool(self._callbacks)Then in this file:
-def _build_eval_callback_manager(config: EvaluationRunConfig): +def _build_eval_callback_manager(config: EvaluationRunConfig) -> "EvalCallbackManager | None": ... - return manager if manager._callbacks else None + return manager if manager.has_callbacks else NoneAs per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (3)
224-239:list_projects()without filtering may be slow in large workspaces.
self._client.list_projects()at line 234 iterates over every project in the workspace to find the max run number. In large LangSmith workspaces with thousands of projects, this could be a significant bottleneck. Consider using a name-contains filter if the LangSmith SDK supports it (similar tolist_datasets(dataset_name_contains=...)used in the optimization callback).Also,
import reat line 230 should be a module-level import rather than a lazy import inside a method —reis a stdlib module with no cost concern.
83-205: Matching logic is solid; minor optimization possible.The retry-based OTEL run matching with longest-substring preference and cross-retry deduplication (
matched_item_ids,processed_run_ids) is well-designed. One minor optimization: line 184 re-scanseval_result.itemslinearly for each match — a pre-builtdict[str, item]keyed byitem_idwould be O(1).
241-276:on_dataset_loaded: missing docstring for public method.Per coding guidelines, public methods should have Google-style docstrings.
on_dataset_loadedandon_eval_completeare part of the callback interface but lack docstrings.As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (1)
196-202: Addexc_info=Trueto debug log for full stack trace.Line 202 logs at debug level but omits
exc_info=True, so the exception details are lost.Proposed fix
except Exception: - logger.debug("Could not update project metadata for '%s'", trial_project) + logger.debug("Could not update project metadata for '%s'", trial_project, exc_info=True)As per coding guidelines: "When catching and logging exceptions without re-raising: always use
logger.exception()to capture the full stack trace information."packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)
30-116: Add return type annotation; avoid private attribute access.
_build_optimizer_callback_managerat line 30 has no return type hint.- Line 64 accesses
manager._callbacksdirectly — same concern as inevaluate.py. Use a public API (e.g.,has_callbacksproperty) onOptimizerCallbackManager.Proposed fix
-def _build_optimizer_callback_manager(base_cfg): +def _build_optimizer_callback_manager(base_cfg: Any) -> "OptimizerCallbackManager | None":And for the private access:
- if not manager._callbacks: + if not manager.has_callbacks:As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (1)
145-191: EvalResult construction logic is duplicated inprompt_optimizer.py.The block at lines 152-178 (building
cb_itemsfromeval_input_itemswith per-item scores, reasoning, and usage stats) is nearly identical to the same block inprompt_optimizer.pylines 425-452. Consider extracting a shared helper likebuild_eval_result_from_output(eval_output, eval_metrics, avg_scores) -> EvalResultto eliminate this duplication.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)
417-470: Unused loop variable and duplicated EvalResult construction.
- Line 421:
idxis unused in the loop body — rename to_to signal intent.- The EvalResult construction block (lines 425-452) is nearly identical to the one in
parameter_optimizer.py(lines 152-178). Consider extracting a shared helper as mentioned in the parameter_optimizer review.Proposed fix for unused variable
- for idx, ind in enumerate(population): + for _idx, ind in enumerate(population):Or simply:
- for idx, ind in enumerate(population): + for ind in population:packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (2)
42-51:OptimizerCallbackProtocol is incomplete — missing duck-typed methods.
get_trial_project_nameandset_prompt_param_namesare invoked on callbacks viagetattrduck-typing (lines 64, 85), but they're not declared in theOptimizerCallbackProtocol. This makes the interface contract unclear for implementors.Consider either adding these as optional protocol methods (with default implementations) or documenting that they're opt-in extensions.
30-39: Missing docstrings onTrialResult,OptimizerCallback, andOptimizerCallbackManager.Per coding guidelines, public classes should have Google-style docstrings.
TrialResult,OptimizerCallback, andOptimizerCallbackManagerall lack them.As per coding guidelines: "Provide Google-style docstrings for every public module, class, function and CLI command."
827eaa5 to
4ecda16
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (1)
144-166:⚠️ Potential issue | 🔴 Critical
_numeric_trial_countis undefined when numeric optimization is disabled —NameErrorat runtime.If
base_cfg.optimizer.numeric.enabledisFalse, theifblock on lines 147–154 is skipped and_numeric_trial_countis never assigned. Line 164 then raises aNameError.Proposed fix
# ---------------- 3. numeric / enum tuning ----------- # tuned_cfg = base_cfg best_numeric_params: dict = {} + _numeric_trial_count = 0 if base_cfg.optimizer.numeric.enabled: tuned_cfg, best_numeric_params, _numeric_trial_count = optimize_parameters(
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/email_phishing_analyzer/src/nat_email_phishing_analyzer/configs/config-langsmith-optimize.yml`:
- Around line 89-95: The YAML under the dataset key uses 4-space relative
indentation; update it to 2-space relative indentation to match the file style.
Specifically, re-indent the block keys file_path, id_key, and the nested
structure mapping (question_key, answer_key) so they are each two spaces further
than dataset: (use 2-space indents for file_path, id_key, structure and another
2-space indent for question_key and answer_key). After adjusting, validate the
YAML to ensure no syntax changes beyond indentation.
In `@packages/nvidia_nat_core/src/nat/observability/exporter_manager.py`:
- Around line 249-261: The ref-count decrement block has a TOCTOU race: after
releasing self._lock you call _cleanup_isolated_exporters() and stop(), allowing
another task calling start() to observe _running True and increment _ref_count
while you tear down; fix by performing lifecycle transition under the lock —
acquire self._lock, decrement self._ref_count, set should_stop = self._ref_count
== 0 and set self._running = False if should_stop, then (still under the lock)
prepare any data needed for cleanup and release the lock before awaiting actual
shutdown; to avoid deadlock, either call a new helper _stop_unlocked() (move
stop() internals into it) or inline the task-cancellation/cleanup logic so that
only non-awaiting state changes happen inside the lock and awaits
(_cleanup_isolated_exporters(), actual stop work) occur after releasing the
lock, ensuring no new callers can piggy-back on a doomed manager.
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 196-202: The exception handler that wraps
self._client.update_project / self._client.read_project currently swallows
errors by calling logger.debug without exception info; update the except block
to log the full stack trace (e.g., replace logger.debug("Could not update
project metadata for '%s'", trial_project) with a logging call that includes
exception info such as logger.exception("Could not update project metadata for
'%s'", trial_project) or logger.debug(..., exc_info=True)) so the caught
exception details are preserved.
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 42-127: The test function
test_eval_callback_creates_dataset_runs_and_feedback leaks the LangSmith dataset
if an assertion fails; wrap the dataset creation and subsequent assertions in a
try/finally so cleanup always runs: assign ds = None before creating the
dataset, then after creating/using it put the verification and eval steps inside
a try block and call langsmith_client.delete_dataset(dataset_id=ds.id) in the
finally block only if ds is not None. Keep references to the existing symbols:
test_eval_callback_creates_dataset_runs_and_feedback, ds,
langsmith_client.delete_dataset, and the
EvalCallbackManager/LangSmithEvaluationCallback usage.
🧹 Nitpick comments (13)
packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (1)
94-96: Chainedgetattris fragile and hard to read.The triple-nested
getattrworks but is brittle. Consider using a safer traversal:♻️ Proposed improvement
def get_tracing_configs(config: Any) -> dict[str, Any]: """Extract tracing configs from a loaded NAT config object.""" - return getattr(getattr(getattr(config, 'general', None), 'telemetry', None), 'tracing', None) or {} + general = getattr(config, 'general', None) + telemetry = getattr(general, 'telemetry', None) if general else None + tracing = getattr(telemetry, 'tracing', None) if telemetry else None + return tracing or {}packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)
159-159: Add return type hint.The private helper is missing a return type annotation.
♻️ Proposed fix
-def _build_eval_callback_manager(config: EvaluationRunConfig): +def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:Since
EvalCallbackManageris imported inside the function body, you can use a string literal or add the import at module level underTYPE_CHECKING:+from __future__ import annotationsor
-def _build_eval_callback_manager(config: EvaluationRunConfig): +def _build_eval_callback_manager(config: EvaluationRunConfig) -> "EvalCallbackManager | None":
192-192: Accessing private_callbacksfrom outside the class is a code smell.
manager._callbacksbreaks encapsulation. Consider adding a public property or method toEvalCallbackManager:♻️ Proposed fix in eval_callbacks.py
class EvalCallbackManager: def __init__(self) -> None: self._callbacks: list[EvalCallback] = [] + `@property` + def has_callbacks(self) -> bool: + return bool(self._callbacks) + def register(self, callback: EvalCallback) -> None:Then in evaluate.py:
- return manager if manager._callbacks else None + return manager if manager.has_callbacks else Nonepackages/nvidia_nat_core/src/nat/eval/evaluate.py (1)
701-736: O(N × M × N) nested loop for building callback items.The triple-nested iteration (input items × evaluators × output items) performs a linear scan to match items by ID. For large datasets this could be slow.
Consider pre-indexing output items by ID per evaluator for O(1) lookups:
♻️ Suggested optimization
cb_items = [] + # Pre-index evaluator outputs by item ID for O(1) lookup + eval_output_index: dict[str, dict[str, Any]] = {} + eval_reasoning_index: dict[str, dict[str, Any]] = {} + for eval_name, eval_output in self.evaluation_results: + score_map = {} + reasoning_map = {} + for output_item in eval_output.eval_output_items: + sid = str(output_item.id) + if isinstance(output_item.score, (int, float)): + score_map[sid] = float(output_item.score) + reasoning_map[sid] = output_item.reasoning + eval_output_index[eval_name] = score_map + eval_reasoning_index[eval_name] = reasoning_map + for input_item in self.eval_input.eval_input_items: per_item_scores = {} per_item_reasoning = {} - for eval_name, eval_output in self.evaluation_results: - for output_item in eval_output.eval_output_items: - if str(output_item.id) == str(input_item.id): - score_val = output_item.score - if isinstance(score_val, (int, float)): - per_item_scores[eval_name] = float(score_val) - per_item_reasoning[eval_name] = output_item.reasoning - break + item_sid = str(input_item.id) + for eval_name in eval_output_index: + if item_sid in eval_output_index[eval_name]: + per_item_scores[eval_name] = eval_output_index[eval_name][item_sid] + if item_sid in eval_reasoning_index[eval_name]: + per_item_reasoning[eval_name] = eval_reasoning_index[eval_name][item_sid]packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/register.py (1)
21-32: Add type hints and docstrings to the builder functions.Per coding guidelines,
{src,packages}/**/*.pyfiles require type hints on parameters and return values, and Google-style docstrings for public modules/functions. While these are private helpers, they serve as plugin entry points and would benefit from proper annotations for maintainability.Suggested improvement
+"""LangSmith callback registration for evaluation and optimization workflows.""" + from nat.cli.register_workflow import register_eval_callback from nat.cli.register_workflow import register_optimizer_callback from nat.plugins.opentelemetry.register import LangsmithTelemetryExporter `@register_eval_callback`(config_type=LangsmithTelemetryExporter) -def _build_langsmith_eval_callback(config, **kwargs): +def _build_langsmith_eval_callback(config: LangsmithTelemetryExporter, **kwargs: Any) -> Any: + """Build a LangSmith evaluation callback from the telemetry exporter config.""" from .langsmith_evaluation_callback import LangSmithEvaluationCallback return LangSmithEvaluationCallback(project=config.project) `@register_optimizer_callback`(config_type=LangsmithTelemetryExporter) -def _build_langsmith_optimizer_callback(config, *, dataset_name=None, **kwargs): +def _build_langsmith_optimizer_callback(config: LangsmithTelemetryExporter, *, dataset_name: str | None = None, **kwargs: Any) -> Any: + """Build a LangSmith optimization callback from the telemetry exporter config.""" from .langsmith_optimization_callback import LangSmithOptimizationCallback return LangSmithOptimizationCallback(project=config.project, dataset_name=dataset_name)As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values" and "Provide Google-style docstrings for every public module, class, function and CLI command."
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)
30-31: Add a return type annotation to_build_optimizer_callback_manager.Per coding guidelines, all functions in
{src,packages}/**/*.pyrequire type hints on parameters and return values.Suggested fix
-def _build_optimizer_callback_manager(base_cfg): +def _build_optimizer_callback_manager(base_cfg: Any) -> Any:Or more precisely, once the import is available:
+from nat.profiler.parameter_optimization.optimizer_callbacks import OptimizerCallbackManager -def _build_optimizer_callback_manager(base_cfg): +def _build_optimizer_callback_manager(base_cfg: Any) -> OptimizerCallbackManager | None:As per coding guidelines: "All public APIs require Python 3.11+ type hints on parameters and return values."
64-65: Accessing the private_callbacksattribute ofOptimizerCallbackManager.Consider adding a public method (e.g.,
has_callbacksproperty or__bool__) toOptimizerCallbackManagerinstead of reaching into its internals.packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (1)
183-187: Silentexcept Exception: passin cleanup blocks — consider logging.Per coding guidelines, when catching exceptions without re-raising, use
logger.exception()orlogger.debug()to capture the stack trace. This aids debugging when cleanup unexpectedly fails.Suggested fix
except Exception: - pass + logger.debug("Cleanup failed", exc_info=True)As per coding guidelines: "When catching and logging exceptions without re-raising: always use
logger.exception()to capture the full stack trace information."Also applies to: 233-237
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)
199-220:pick_trialis called twice with the same arguments.
pick_trialis invoked at line 199 to getbest_paramsand again at line 207 for the callback. Depending on the cost ofpick_trial, this is redundant — reuse the first result.Suggested fix
- best_params = pick_trial( + best_trial_obj = pick_trial( study=study, mode=optimizer_config.multi_objective_combination_mode, weights=weights, ).params + best_params = best_trial_obj.params if callback_manager is not None: from nat.profiler.parameter_optimization.optimizer_callbacks import TrialResult - best_trial_obj = pick_trial( - study=study, - mode=optimizer_config.multi_objective_combination_mode, - weights=weights, - ) callback_manager.on_study_end(
145-191: EvalResult construction block is near-identical inprompt_optimizer.py.The logic to build
EvalResult/EvalResultItemfromeval_output(lines 150–179) is duplicated inprompt_optimizer.py(lines 422–453). Consider extracting a shared helper (e.g.,build_eval_result_from_output) to reduce duplication and ease future maintenance.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/prompt_optimizer.py (1)
417-470: Unused loop variableidxon line 421.Ruff flags
idx(B007). Since it's not used in the loop body, prefix it with_.Fix
- for idx, ind in enumerate(population): + for _idx, ind in enumerate(population):packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py (1)
224-239:get_eval_project_nameiterates all projects to find the max Run #.
self._client.list_projects()may return a large number of projects in production environments. This could become a latency bottleneck. Consider filtering by name prefix if the LangSmith API supports it, or caching the result.packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (1)
282-330: Template format detection is well-reasoned.The detection priority is clearly documented, and the mustache-before-
{#ordering correctly avoids the substring ambiguity. The intentional default to jinja2 for bare{{is reasonable for the Python/LangChain ecosystem.One note: lines 323–327 could be slightly simplified since both branches return
"jinja2":Optional simplification
# 4. {{ }} present — disambiguate via expression keywords if "{{" in text: - if any(kw in text for kw in cls._JINJA2_EXPR_KEYWORDS): - return "jinja2" - # Plain {{ }} — default to jinja2 (more common in Python) return "jinja2"However, the current form does serve as documentation of the detection reasoning, so keeping it explicit is also fine.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py`:
- Around line 66-70: The set_prompt_param_names method currently iterates
callbacks and calls set_prompt_param_names without protection, so an exception
in one callback will abort the loop; update set_prompt_param_names in the
callback manager to mirror other dispatchers by wrapping each per-callback
invocation in a try/except that catches exceptions from fn(names), logs or
handles the error consistently with existing patterns (use the same
logger/behavior used in other methods), and continues to the next callback
(refer to self._callbacks iteration and the getattr(cb,
"set_prompt_param_names", None) lookup to locate the exact spot to modify).
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py`:
- Around line 461-479: on_study_end currently fetches the most recent commit
(limit=1, offset=0) and may tag the wrong commit; instead locate the exact
commit for the best trial using the stored mapping
self._prompt_commit_urls[(param_name, best_trial.trial_number)] and call
self._client._create_commit_tags with that commit identifier (or URL) to add the
"best" tag; if the mapping is missing, re-push the prompt for that param/trial
via the existing _push_prompt(param_name, prompt_text, tags=["best"]) (or
equivalent) so the correct commit is created and tagged.
🧹 Nitpick comments (13)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py (3)
322-327: Dead branch — keyword check has no effect on the outcome.When
{{is present, both the keyword-match path and the fallback return"jinja2", making theany(kw in text ...)check superfluous. If the intent is to distinguish mustache{{ }}from jinja2, the keyword-absent path should return"mustache"(or at least document why both resolve to jinja2). As-is, it's misleading to future readers.Proposed simplification (if both cases really are jinja2)
# 4. {{ }} present — disambiguate via expression keywords if "{{" in text: - if any(kw in text for kw in cls._JINJA2_EXPR_KEYWORDS): - return "jinja2" - # Plain {{ }} — default to jinja2 (more common in Python) return "jinja2"
414-435: Inconsistent value types inprompt_urlsreturn dict.On success,
prompt_urls[param_name]is a URL string (line 414). On conflict, it's therepo_nameslug (line 418). On failure, it's the rawprompt_text(line 435). Downstream consumers of this dict cannot reliably distinguish these cases. Consider using a consistent type or a richer return structure (e.g., a dataclass/named tuple with status).
53-54: Missing docstrings on public methods.
set_prompt_param_names,on_trial_end, andon_study_endare public API methods but lack Google-style docstrings. As per coding guidelines, every public function requires a docstring with a concise first-line description ending with a period.packages/nvidia_nat_core/src/nat/eval/eval_callbacks.py (2)
51-57: Protocol methods lack docstrings.Per coding guidelines, public APIs should have Google-style docstrings. The
EvalCallbackprotocol and its methods would benefit from brief docstrings describing when each hook is invoked and what the caller guarantees.
98-100: Deeply chainedgetattris fragile and hard to read.The triple-nested
getattrworks but is opaque. A small helper or sequential approach improves readability and debuggability.♻️ Suggested refactor
def get_tracing_configs(config: Any) -> dict[str, Any]: """Extract tracing configs from a loaded NAT config object.""" - return getattr(getattr(getattr(config, 'general', None), 'telemetry', None), 'tracing', None) or {} + general = getattr(config, 'general', None) + telemetry = getattr(general, 'telemetry', None) + return getattr(telemetry, 'tracing', None) or {}packages/nvidia_nat_core/src/nat/cli/commands/evaluate.py (2)
159-195: Missing return type annotation.Per coding guidelines, all functions should use type hints for parameters and return values.
_build_eval_callback_manageris missing a return type.♻️ Proposed fix
-def _build_eval_callback_manager(config: EvaluationRunConfig): +def _build_eval_callback_manager(config: EvaluationRunConfig) -> EvalCallbackManager | None:You'll need a
TYPE_CHECKINGguard for the import:from __future__ import annotations from typing import TYPE_CHECKING if TYPE_CHECKING: from nat.eval.eval_callbacks import EvalCallbackManager
186-190: Side-effect mutation ofconfig.config_fileis subtle.
config.config_file = loaded(line 190) silently replaces the caller'sPathwith a loaded model object. This works becauserun_and_evaluatepassesconfigtoEvaluationRun, but it's a non-obvious side-effect. A brief inline comment explaining why this mutation is needed would help future maintainers.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_runtime.py (2)
30-31: Missing type annotations on_build_optimizer_callback_manager.Per coding guidelines, all Python methods should have type hints for parameters and return values. The
base_cfgparameter and the return type are untyped.♻️ Proposed fix
-def _build_optimizer_callback_manager(base_cfg): +def _build_optimizer_callback_manager(base_cfg: Config) -> OptimizerCallbackManager | None:Use
TYPE_CHECKINGguards as needed to avoid circular imports.
44-51: Duplicate extraction ofds_cfg/file_path.The dataset config path is resolved identically at lines 46–49 and again at lines 75–78. Extract this into a local helper or resolve once before the loop.
♻️ Suggested refactor
+ # Resolve dataset config once + ds_cfg = None + file_path = None + try: + ds_cfg = base_cfg.eval.general.dataset + file_path = getattr(ds_cfg, 'file_path', None) + except Exception: + logger.debug("Could not extract dataset config", exc_info=True) + # Extract dataset name from eval config (runtime concern, not plugin-specific) opt_dataset_name = None - try: - ds_cfg = base_cfg.eval.general.dataset - file_path = getattr(ds_cfg, 'file_path', None) - if file_path: - opt_dataset_name = Path(file_path).stem - except Exception: - logger.debug("Could not extract dataset name from config", exc_info=True) + if file_path: + opt_dataset_name = Path(file_path).stem ... # Pre-create experiments ... try: - ds_cfg = base_cfg.eval.general.dataset - file_path = getattr(ds_cfg, 'file_path', None) if file_path: fp = Path(file_path)Also applies to: 75-78
packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/parameter_optimizer.py (2)
145-191: Usestrict=Trueinzip()calls to catch length mismatches early.
eval_metricsandavg_scoresshould always have the same length, butzip(..., strict=True)makes this assumption explicit and fails fast if violated. Applies to lines 179, 187, and 212.♻️ Proposed fix (line 179 example; apply similarly to 187 and 212)
- eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores)), items=cb_items) + eval_result = EvalResult(metric_scores=dict(zip(eval_metrics, avg_scores, strict=True)), items=cb_items)
152-178: O(n × m) item-to-score matching via nested loops.For each
input_item, the inner loop scans alleval_output_itemsof every evaluator to find a matching ID. This is quadratic in the number of items × evaluators. For typical eval datasets this is fine, but if datasets grow, pre-indexingeval_output_itemsby ID into a dict would be more efficient.packages/nvidia_nat_core/src/nat/profiler/parameter_optimization/optimizer_callbacks.py (2)
42-51:set_prompt_param_namesandget_trial_project_nameare duck-typed but absent from the Protocol.The
OptimizerCallbackprotocol definespre_create_experiment,on_trial_end, andon_study_end, but the manager also dispatchesset_prompt_param_namesandget_trial_project_nameviagetattr. This makes the callback contract incomplete — implementors won't know about these optional hooks from the Protocol alone. Consider either adding them to the Protocol (with default no-op implementations or as optional methods) or documenting the duck-typed extension points in the class/module docstring.Also applies to: 66-70, 86-95
30-39:TrialResultandOptimizerCallbacklack Google-style docstrings.Per coding guidelines, public classes should have docstrings.
TrialResultfields are well-commented inline, but a class-level docstring describing its purpose would complete the documentation.
5853f04 to
4f597d6
Compare
…callbacks Add LangSmith integration for NAT evaluation and optimization flows: - Eager trace linking: pre-generate OTEL root span_ids in the eval loop so LangSmith run_ids can be derived deterministically (span_id → UUID mapping), enabling immediate update_run()/create_feedback() calls without waiting for list_runs() indexing (10-40s+ delay eliminated). Falls back to substring matching when eager linking fails. - Evaluation callback: creates LangSmith datasets/experiments, links OTEL traces to dataset examples, attaches evaluator feedback scores. - Optimization callback: per-trial experiment projects with OTEL trace linking, prompt version management via LangSmith prompt repos, and dynamic retry budget for the substring-matching fallback path. - Scoped via needs_root_span_ids opt-in so non-LangSmith exporters are completely unaffected by the core package changes. Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
c36fc35 to
7e8b324
Compare
Revert ExporterManager to per-item isolated lifecycle. Each eval item starts/stops its own exporter independently. Removes ref counting, shared Subject forwarding, and the eval loop pre-start wrapper. Restores unconditional wait_for_all_export_tasks_local before callbacks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Tracing config extraction is an observability concern, not eval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
…ntation Signed-off-by: Santiago Pastoriza <spastoriza@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
6ed3e23 to
9b4ad9a
Compare
Address PR review "extract to own method" nits by pulling inline callback code into dedicated functions: - build_eval_result() in eval_callbacks.py — single helper replacing the ~25-line EvalResult construction block duplicated in 3 callers - _fire_on_eval_complete() on EvaluationRun in evaluate.py - _fire_numeric_trial_end() / _fire_numeric_study_end() in parameter_optimizer.py - _fire_prompt_trial_end_callbacks() / _fire_prompt_study_end() in prompt_optimizer.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
aa1f798 to
31e5615
Compare
dnandakumar-nv
left a comment
There was a problem hiding this comment.
Approve pending resoltion of minor nits below. Also need a review from @NVIDIA/nat-dep-approvers due to pyproject.toml change.
Alternatively, @pastorsj I think you can remove the langsmith library dependency from the pyrpoject.toml as it is already implicitly downloaded given our current dependencies.
- Add type hint for callback_manager param in EvaluationRun.__init__ - Deduplicate _ensure_dataset and pre_create_experiment into shared _create_dataset_with_examples helper - Remove langsmith from explicit dependencies (transitive via langchain-core) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
|
/ok to test 0abb128 |
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
|
/ok to test 11a9bfd |
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
|
/ok to test 1248bab |
Signed-off-by: Sam Pastoriza <spastoriza@nvidia.com>
|
/ok to test 3e3ca58 |
|
/ok to test f6ead01 |
|
/ok to test 962d799 |
|
/merge |
Description
By Submitting this PR I confirm:
Examples in LangSmith
Evaluations
Optimizations
Summary by CodeRabbit
New Features
Bug Fixes / Behavior
Documentation
Tests