Skip to content

feat: add extract_reasoning_content option to LLM columns#285

Merged
eric-tramel merged 5 commits intomainfrom
ewt/extract-reasoning-content
Feb 3, 2026
Merged

feat: add extract_reasoning_content option to LLM columns#285
eric-tramel merged 5 commits intomainfrom
ewt/extract-reasoning-content

Conversation

@eric-tramel
Copy link
Contributor

Summary

  • Adds opt-in extract_reasoning_content: bool = False field to LLMTextColumnConfig (and all derived LLM configs)
  • When enabled, creates a {name}__reasoning_content side-effect column containing only the reasoning content from the final assistant response
  • Extracts and strips the reasoning_content field from the last assistant message in the trace, normalizing whitespace-only values to None

Usage

import data_designer.config as dd

config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="response",
        prompt="Solve this problem: {{ problem }}",
        model_alias="reasoning-model",
        extract_reasoning_content=True,  # Creates response__reasoning_content column
    )
)

This is useful for models that expose chain-of-thought reasoning separately from the main response (e.g., models with extended thinking capabilities).

Test plan

  • Config tests verify side_effect_columns behavior with/without extract_reasoning_content
  • Generator tests verify reasoning content extraction from various trace types
  • All existing tests continue to pass

🤖 Generated with Claude Code

@eric-tramel eric-tramel requested a review from a team as a code owner February 2, 2026 23:51
@eric-tramel eric-tramel marked this pull request as draft February 2, 2026 23:52
@eric-tramel eric-tramel self-assigned this Feb 2, 2026
@greptile-apps
Copy link

greptile-apps bot commented Feb 2, 2026

Greptile Overview

Greptile Summary

This PR adds an opt-in extract_reasoning_content boolean field to all LLM column configurations, enabling separate capture of chain-of-thought reasoning from models that expose it via the reasoning_content field. When enabled, a {column_name}__reasoning_content side-effect column is created containing the stripped reasoning content from the final assistant message in the trace.

Key changes:

  • Added extract_reasoning_content: bool = False field to LLMTextColumnConfig (inherited by LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig)
  • Updated side_effect_columns property to conditionally include the reasoning content column based on feature flags (with_trace and extract_reasoning_content)
  • Implemented _extract_reasoning_content() method in ColumnGeneratorWithModelChatCompletion that extracts reasoning from the last assistant message, strips whitespace, and normalizes empty values to None
  • Added comprehensive tests covering various scenarios: extraction with content, missing content, tool-use traces with multiple assistant messages, disabled feature, and whitespace-only content
  • Updated documentation in columns.md and traces.md with usage examples and a comparison table
  • Demonstrated the feature in the pdf_qa.py recipe

Issues found:

  • Minor: Test at line 143 in test_columns.py is missing an assertion to verify side_effect_columns includes the trace column when with_trace=TraceType.LAST_MESSAGE

The implementation is well-designed, correctly handles edge cases (whitespace-only, missing reasoning, multi-assistant traces), and maintains independence between trace and reasoning content features.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is clean, well-tested with comprehensive edge case coverage, properly documented, and follows the project's architecture patterns. The only issue is a minor missing test assertion that doesn't affect functionality.
  • No files require special attention

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/column_configs.py Added extract_reasoning_content boolean field to LLMTextColumnConfig and updated side_effect_columns property to conditionally include reasoning content column based on this field
packages/data-designer-config/tests/config/test_columns.py Updated test expectations to correctly handle side effect columns based on feature flags; added comprehensive test for extract_reasoning_content feature, but missing assertion for config_last.side_effect_columns
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/llm_completion.py Added _extract_reasoning_content method that extracts and strips reasoning content from final assistant message; integrated into generate method to populate reasoning column when enabled

Sequence Diagram

sequenceDiagram
    participant User
    participant ConfigBuilder
    participant LLMTextColumnConfig
    participant Generator as LLMTextCellGenerator
    participant Model as ModelFacade
    participant Output as DataFrame

    User->>ConfigBuilder: add_column(LLMTextColumnConfig(..., extract_reasoning_content=True))
    ConfigBuilder->>LLMTextColumnConfig: Create config with extract_reasoning_content=True
    LLMTextColumnConfig->>LLMTextColumnConfig: side_effect_columns property returns ["col__reasoning_content"]
    
    Note over User,Output: Dataset Generation Phase
    
    User->>Generator: generate(data)
    Generator->>Model: generate(prompt, ...)
    Model-->>Generator: (output, trace)
    
    Note over Generator: trace = [<br/>  ChatMessage(role="user", ...),<br/>  ChatMessage(role="assistant", content="...", reasoning_content="..."),<br/>]
    
    Generator->>Generator: _extract_reasoning_content(trace)
    Generator->>Generator: Find last assistant message in reversed trace
    Generator->>Generator: Extract reasoning_content field
    Generator->>Generator: Strip whitespace, normalize empty to None
    
    Generator->>Output: data[col] = output
    Generator->>Output: data[col__reasoning_content] = extracted_reasoning
    Output-->>User: DataFrame with main column and reasoning column
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@eric-tramel eric-tramel force-pushed the ewt/extract-reasoning-content branch from 4d55df0 to 18700b7 Compare February 3, 2026 01:23
Add an opt-in way for LLM generation columns to persist only the model's
reasoning_content (when the provider exposes it) into a dedicated
side-effect column.

When `extract_reasoning_content=True` is set on an LLM column config:
- A new `{name}__reasoning_content` column is created
- Only the reasoning_content from the final assistant message is stored
- Whitespace is trimmed, empty values become None

This is independent of the existing `with_trace` option which stores
the full conversation history.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@eric-tramel eric-tramel force-pushed the ewt/extract-reasoning-content branch from 18700b7 to 04130e9 Compare February 3, 2026 01:42
@eric-tramel eric-tramel marked this pull request as ready for review February 3, 2026 01:42
@eric-tramel eric-tramel added the enhancement New feature or request label Feb 3, 2026
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@eric-tramel eric-tramel merged commit 532d21a into main Feb 3, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants