UN-3322 [FIX] Optimize dashboard metrics aggregation task#1837
UN-3322 [FIX] Optimize dashboard metrics aggregation task#1837
Conversation
Reduce DB queries per org from ~42 to ~19 and add self-healing lock: - Combine 4 LLM metric queries into 1 using conditional aggregation - Cache org identifier lookup for PageUsage-based metrics - Merge daily+monthly queries into single source query (backfill pattern) - Tighten active org filter from 2 months to 7 days - Add self-healing Redis lock to prevent stuck-lock incidents - Remove unused org_identifier param from get_failed_pages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughResolves org identifiers once, consolidates LLM metric queries into a combined/split flow, adds a self-healing timestamp-based aggregation lock, and introduces helper primitives to aggregate hourly, daily, and monthly metrics with fewer queries. Changes
Sequence Diagram(s)sequenceDiagram
participant Task as AggregationTask
participant Lock as LockStore
participant Service as MetricsQueryService
participant DB as Database
Task->>Lock: _acquire_aggregation_lock()
Lock->>DB: read/update lock record (timestamp)
alt lock acquired
Task->>Service: _resolve_org_identifier(org_id)
Service->>DB: fetch Organization.organization_id (if needed)
Task->>Service: get_llm_metrics_split(org_id, start/end, granularity)
Service->>DB: execute combined LLM query (single query per granularity)
DB-->>Service: combined LLM rows
Service-->>Task: split into per-metric series (llm_calls, challenges, ...)
Task->>Service: _aggregate_single_metric(...) for each non-LLM metric
Service->>DB: metric-specific queries (hourly + daily/monthly)
DB-->>Service: metric rows
Service-->>Task: aggregated buckets (hourly/daily/monthly)
Task->>DB: bulk upsert aggregated metrics
Task->>Lock: release lock
else lock held or stale reclaimed
Task-->>Task: skip or reclaim then proceed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/dashboard_metrics/tasks.py (1)
507-519:⚠️ Potential issue | 🟠 Major
WorkflowExecutionis too narrow for the new active-org prefilter.In cloud deployments,
hitl_completionsis aggregated fromHITLQueue.approved_at, so an organization can have fresh completions inside the 7-day window even when its latestWorkflowExecution.created_atis older. This filter skips that org entirely and leaves those daily/monthly buckets stale. Please include the other source tables that can emit fresh metrics in the window, or exempt those metrics from the prefilter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 507 - 519, The active_org_ids prefilter currently gathers orgs only from WorkflowExecution.created_at (see variable active_org_ids and daily_start), which misses orgs with recent HITL completions recorded via HITLQueue.approved_at and therefore causes hitl_completions buckets to be skipped; modify the prefilter to also include organization IDs from HITLQueue (use approved_at >= daily_start) and any other recent-metric source tables that can emit fresh metrics in the 7-day window, or alternatively exclude hitl_completions (and other affected metric calculations) from using active_org_ids so those metrics always scan their source tables without the prefilter. Ensure you update the set construction that builds active_org_ids to union in these additional org IDs (or branch the metric path to bypass the prefilter) so daily/monthly buckets for HITL and similar sources are not left stale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/services.py`:
- Around line 246-250: The wrapper helpers get_llm_calls, get_challenges,
get_summarization_calls, and get_llm_usage_cost each call
get_llm_metrics_combined separately, causing the heavy combined aggregation to
run multiple times; update the hot callers (notably
backfill_metrics.Command._collect_metrics and views.live_series) to call
MetricsQueryService.get_llm_metrics_combined once per organization/date
range/granularity and then derive the four values from that single result
instead of invoking the single-metric wrappers repeatedly; leave the wrapper
helpers for convenience but change the callers to fetch the combined result and
map to
{"period","llm_calls","challenges","summarization_calls","llm_usage_cost"} (or
extract the specific fields needed) to avoid quadruple aggregation.
In `@backend/dashboard_metrics/tasks.py`:
- Around line 338-340: The daily cutoff comparison uses a timeful daily_start
against midnight-truncated day_ts, which excludes the boundary day; normalize
the cutoff to a day boundary by comparing dates (e.g., day_ts.date() >=
daily_start.date()) or by truncating daily_start to midnight before comparisons
so the bucket exactly 7 days ago is included; apply the same change to the other
occurrence that uses day_ts and daily_start (the block that builds the key with
(org_id, day_ts.date().isoformat(), metric_name, "default", "") and the similar
logic around the second helper).
---
Outside diff comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 507-519: The active_org_ids prefilter currently gathers orgs only
from WorkflowExecution.created_at (see variable active_org_ids and daily_start),
which misses orgs with recent HITL completions recorded via
HITLQueue.approved_at and therefore causes hitl_completions buckets to be
skipped; modify the prefilter to also include organization IDs from HITLQueue
(use approved_at >= daily_start) and any other recent-metric source tables that
can emit fresh metrics in the 7-day window, or alternatively exclude
hitl_completions (and other affected metric calculations) from using
active_org_ids so those metrics always scan their source tables without the
prefilter. Ensure you update the set construction that builds active_org_ids to
union in these additional org IDs (or branch the metric path to bypass the
prefilter) so daily/monthly buckets for HITL and similar sources are not left
stale.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1bb907c1-14fd-4aaa-be13-db97326b34e6
📒 Files selected for processing (3)
backend/dashboard_metrics/management/commands/backfill_metrics.pybackend/dashboard_metrics/services.pybackend/dashboard_metrics/tasks.py
- Replace 4 thin LLM wrapper methods with get_llm_metrics_split() that fetches combined data once and splits into per-metric series - Update live_series view to use single combined query (4→1 DB queries) - Update backfill_metrics command to use combined query (8→2 DB queries) - Fix daily_start boundary day exclusion by truncating to midnight - Rename resolved → page_usage_org_id for clarity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
backend/dashboard_metrics/views.py (1)
840-843: Dead code:metric_name == "llm_usage"check is now unreachable.Since LLM metrics (including
llm_usage) are handled in the earlier block (lines 782-822) and excluded frommetric_queriesviaLLM_METRIC_KEYS, this condition at line 842 will never be true for non-LLM metrics processed here.🧹 Suggested simplification
series.append( { "metric_name": metric_name, - "metric_type": MetricType.HISTOGRAM - if metric_name == "llm_usage" - else MetricType.COUNTER, + "metric_type": MetricType.COUNTER, "data": [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 840 - 843, The conditional that sets MetricType based on metric_name == "llm_usage" is dead code because LLM metrics (including "llm_usage") are handled earlier and excluded from metric_queries via LLM_METRIC_KEYS; update the block that constructs the metric dict (the code using "metric_name" and MetricType) to always assign MetricType.COUNTER (remove the if/else branch) so the metric creation is simplified and no unreachable check remains.backend/dashboard_metrics/management/commands/backfill_metrics.py (1)
61-67: Class attribute type hint for mutable defaults.Static analysis flags
LLM_METRIC_TYPESas a mutable class attribute. While this is safe since it's read-only in practice, you could addClassVarfor explicit typing if desired.🧹 Optional type hint improvement
+from typing import Any, ClassVar ... - LLM_METRIC_TYPES: dict[str, bool] = { + LLM_METRIC_TYPES: ClassVar[dict[str, bool]] = { "llm_calls": False, "challenges": False, "summarization_calls": False, "llm_usage": True, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 61 - 67, LLM_METRIC_TYPES is a mutable class attribute flagged by static analysis; annotate it as a class-level constant by adding a ClassVar type hint (e.g., from typing import ClassVar) so its declaration becomes ClassVar[dict[str, bool]] and makes the intent explicit; locate the LLM_METRIC_TYPES attribute in the class in backfill_metrics.py and update its type annotation accordingly while keeping the same value.backend/dashboard_metrics/tasks.py (2)
285-359: Cognitive complexity is elevated but the structure is clear.SonarCloud flags this function for complexity (16 vs 15 allowed). The logic is straightforward (hourly loop + daily/monthly loop with split). Consider extracting the daily/monthly bucket loop into a small helper if you want to address the warning, but it's not critical.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 285 - 359, The function _aggregate_single_metric exceeds allowed cognitive complexity; extract the daily+monthly bucketing logic (the second query loop that builds daily_agg and monthly_buckets and the subsequent monthly_agg population) into a small helper function (e.g., _bucket_daily_and_monthly or similar) that accepts rows/params (org_id, daily_start, metric_name, metric_type, daily_agg, monthly_agg) and returns/updates monthly_buckets, then call that helper from _aggregate_single_metric and keep the hourly loop as-is to reduce complexity.
362-440: Cognitive complexity warning for_aggregate_llm_combined.SonarCloud reports complexity of 26 (vs 15 allowed). The nested loops over
llm_combined_fieldsinside both the daily and monthly paths contribute significantly. Extracting the field iteration into a helper could reduce nesting.♻️ Suggested refactor to reduce complexity
def _ingest_llm_row_to_agg( row: dict, ts_str: str, org_id: str, agg_dict: dict, llm_combined_fields: dict, ) -> None: """Ingest a single LLM combined row into an aggregation dict.""" for field, (metric_name, metric_type) in llm_combined_fields.items(): value = row[field] or 0 key = (org_id, ts_str, metric_name, "default", "") if key not in agg_dict: agg_dict[key] = {"metric_type": metric_type, "value": 0, "count": 0} agg_dict[key]["value"] += value agg_dict[key]["count"] += 1Then call this helper in both hourly and daily loops.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 362 - 440, The _aggregate_llm_combined function is over-complex due to duplicated nested loops over llm_combined_fields; extract the inner per-field logic into a helper (e.g., _ingest_llm_row_to_agg) that accepts row, ts_str, org_id, agg_dict, and llm_combined_fields and performs the value extraction, key construction, metric_type init, and accumulation, then call that helper from the hourly loop and from the daily branch to replace the duplicated code; for the monthly bucketing extract a small helper (e.g., _ingest_llm_row_to_monthly_bucket) that builds the (month_key_str, metric_name) bkey, initializes bucket entries and accumulates value/count, and use it in the monthly loop—this reduces nesting and cognitive complexity while preserving existing behavior in _aggregate_llm_combined.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 203-243: _acquire_aggregation_lock has a race between
cache.delete(...) and cache.add(...) when reclaiming stale/corrupted locks;
replace the non-atomic delete+add with atomic Redis operations using the raw
client from django_redis (get_redis_connection) so you perform SET NX EX for the
initial grab and use GET/GETSET (or GETSET + check old value and set TTL) when
reclaiming stale locks to ensure you only replace the lock if it still contains
the stale timestamp you observed; update references to AGGREGATION_LOCK_KEY and
the cache.add/cache.delete usage in _acquire_aggregation_lock to use
redis.set(..., nx=True, ex=...) and redis.getset / expire logic with a
float-check of the previous value to avoid the race.
---
Nitpick comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 61-67: LLM_METRIC_TYPES is a mutable class attribute flagged by
static analysis; annotate it as a class-level constant by adding a ClassVar type
hint (e.g., from typing import ClassVar) so its declaration becomes
ClassVar[dict[str, bool]] and makes the intent explicit; locate the
LLM_METRIC_TYPES attribute in the class in backfill_metrics.py and update its
type annotation accordingly while keeping the same value.
In `@backend/dashboard_metrics/tasks.py`:
- Around line 285-359: The function _aggregate_single_metric exceeds allowed
cognitive complexity; extract the daily+monthly bucketing logic (the second
query loop that builds daily_agg and monthly_buckets and the subsequent
monthly_agg population) into a small helper function (e.g.,
_bucket_daily_and_monthly or similar) that accepts rows/params (org_id,
daily_start, metric_name, metric_type, daily_agg, monthly_agg) and
returns/updates monthly_buckets, then call that helper from
_aggregate_single_metric and keep the hourly loop as-is to reduce complexity.
- Around line 362-440: The _aggregate_llm_combined function is over-complex due
to duplicated nested loops over llm_combined_fields; extract the inner per-field
logic into a helper (e.g., _ingest_llm_row_to_agg) that accepts row, ts_str,
org_id, agg_dict, and llm_combined_fields and performs the value extraction, key
construction, metric_type init, and accumulation, then call that helper from the
hourly loop and from the daily branch to replace the duplicated code; for the
monthly bucketing extract a small helper (e.g.,
_ingest_llm_row_to_monthly_bucket) that builds the (month_key_str, metric_name)
bkey, initializes bucket entries and accumulates value/count, and use it in the
monthly loop—this reduces nesting and cognitive complexity while preserving
existing behavior in _aggregate_llm_combined.
In `@backend/dashboard_metrics/views.py`:
- Around line 840-843: The conditional that sets MetricType based on metric_name
== "llm_usage" is dead code because LLM metrics (including "llm_usage") are
handled earlier and excluded from metric_queries via LLM_METRIC_KEYS; update the
block that constructs the metric dict (the code using "metric_name" and
MetricType) to always assign MetricType.COUNTER (remove the if/else branch) so
the metric creation is simplified and no unreachable check remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 30788ac3-29a5-450c-9105-71430a26b833
📒 Files selected for processing (4)
backend/dashboard_metrics/management/commands/backfill_metrics.pybackend/dashboard_metrics/services.pybackend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
- Extract _upsert_agg helper in tasks.py to eliminate repeated dict-init-and-increment pattern in both aggregation functions - Extract _build_series_entry and _build_error_entry helpers in views.py to deduplicate series construction in live_series endpoint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move LLM and non-LLM metric fetching logic out of the view method into a standalone function, bringing the view's cognitive complexity well under the Sonar threshold of 15. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/dashboard_metrics/tasks.py (1)
450-462:⚠️ Potential issue | 🔴 CriticalWorkflowExecution prefilter will silently skip orgs with recent PageUsage, Usage, or HITLQueue data.
The prefilter (lines 450-462) uses only
WorkflowExecutionwith a 7-day window to select which organizations to process. However, the following metrics query their data directly and independently:
pages_processed→ queriesPageUsagedirectlyllm_calls,challenges,summarization_calls,llm_usage→ queryUsagedirectlyhitl_reviews,hitl_completions→ queryHITLQueuedirectlyIf an org has recent activity in any of these tables but no
WorkflowExecutionin the past 7 days, the entire org is skipped and all its metrics are silently not aggregated. This is a data integrity issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 450 - 462, The prefilter active_org_ids currently only pulls org IDs from WorkflowExecution using daily_start; update it to include org IDs with recent activity from PageUsage, Usage, and HITLQueue as well (using the same daily_start window) so that pages_processed, llm_calls/challenges/summarization_calls/llm_usage, and hitl_reviews/hitl_completions aren’t skipped; implement this by computing the union of distinct organization IDs from WorkflowExecution.values_list("workflow__organization_id"), PageUsage.values_list("organization_id"), Usage.values_list("organization_id"), and HITLQueue.values_list("organization_id") into the existing active_org_ids set (or replace active_org_ids with a combined queryset/set) to ensure all orgs with recent activity in any of those tables are processed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 360-383: The combined-LLM loops are upserting zeroed metrics for
fields that weren’t actually present in the bucket; modify the fan-out in the
loops that iterate over MetricsQueryService.get_llm_metrics_combined so you only
call _upsert_agg for a given (field, metric_name) when the field was truly
present in the row: first prefer a per-field presence flag if the query exposes
one (e.g. row.get(f"{field}_present") is truthy), otherwise require row[field]
is not None (and skip when row[field] is missing/None) before computing value
and calling _upsert_agg for hourly_agg, daily_agg, and monthly_agg; keep the
surrounding timestamp handling (_truncate_to_hour/_day/_month,
llm_combined_fields) unchanged.
In `@backend/dashboard_metrics/views.py`:
- Around line 49-73: Replace the implicit llm_usage-vs-everything-else logic
with an explicit metric type map (e.g., METRIC_TYPE_MAP = {"llm_usage":
MetricType.HISTOGRAM, "pages_processed": MetricType.HISTOGRAM, ...}) and use
that map inside _build_series_entry and _build_error_entry to determine
metric_type (falling back to MetricType.COUNTER if not listed); update
_build_series_entry to read metric_type = METRIC_TYPE_MAP.get(metric_name,
MetricType.COUNTER) and include it in the returned dict, and update
_build_error_entry to also look up and include the same metric_type so partial
failures preserve the correct type.
- Around line 792-829: The metric mapping in metric_queries dropped the
"failed_pages" entry so requests to GET /live-series?metric_name=failed_pages
return empty series; restore the mapping by adding "failed_pages":
MetricsQueryService.get_failed_pages to the metric_queries dict (in the same
block where documents_processed, pages_processed, etc. are defined) so
_build_series_entry and the aggregation in backend/dashboard_metrics/tasks.py
continue to work, or alternatively add explicit validation at the start of the
handler to reject unknown metrics (using MetricsQueryService.LLM_METRIC_KEYS and
the metric_queries keys) so a missing metric is surfaced instead of returning an
empty series.
---
Outside diff comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 450-462: The prefilter active_org_ids currently only pulls org IDs
from WorkflowExecution using daily_start; update it to include org IDs with
recent activity from PageUsage, Usage, and HITLQueue as well (using the same
daily_start window) so that pages_processed,
llm_calls/challenges/summarization_calls/llm_usage, and
hitl_reviews/hitl_completions aren’t skipped; implement this by computing the
union of distinct organization IDs from
WorkflowExecution.values_list("workflow__organization_id"),
PageUsage.values_list("organization_id"), Usage.values_list("organization_id"),
and HITLQueue.values_list("organization_id") into the existing active_org_ids
set (or replace active_org_ids with a combined queryset/set) to ensure all orgs
with recent activity in any of those tables are processed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 269487ef-bace-4842-8c97-b0e70bda04d4
📒 Files selected for processing (2)
backend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
Add _HISTOGRAM_METRICS set and _metric_type helper to correctly classify pages_processed and failed_pages as histograms, consistent with tasks.py and backfill_metrics.py. Previously only llm_usage was marked as histogram in the views layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/dashboard_metrics/views.py (1)
76-127: Cognitive complexity exceeds threshold (22 vs 15 allowed).SonarCloud flags this function for high cognitive complexity. Consider extracting the LLM metrics handling (lines 94-110) into a separate helper function to reduce nesting.
Suggested refactor
+def _fetch_llm_series( + org_id: str, + start_date: datetime, + end_date: datetime, + granularity: str, + requested_metric: str | None, + llm_metric_keys: dict, +) -> tuple[list[dict], list[str]]: + """Fetch LLM metrics series.""" + series: list[dict] = [] + errors: list[str] = [] + try: + llm_split = MetricsQueryService.get_llm_metrics_split( + org_id, start_date, end_date, granularity + ) + for name, data in llm_split.items(): + if not requested_metric or name == requested_metric: + series.append(_build_series_entry(name, data)) + except Exception: + logger.exception("Failed to fetch LLM metrics") + for name in llm_metric_keys: + if not requested_metric or name == requested_metric: + errors.append(name) + series.append(_build_error_entry(name)) + return series, errors + + def _fetch_live_series( org_id: str, start_date: datetime, end_date: datetime, granularity: str, metric_queries: dict, requested_metric: str | None, ) -> tuple[list[dict], list[str]]: - """Fetch all metric series (LLM combined + individual). - - Returns: - Tuple of (series list, error names list). - """ - series: list[dict] = [] - errors: list[str] = [] + """Fetch all metric series (LLM combined + individual).""" llm_metric_keys = MetricsQueryService.LLM_METRIC_KEYS - # Fetch all 4 LLM metrics in a single query + series: list[dict] = [] + errors: list[str] = [] + if not requested_metric or requested_metric in llm_metric_keys: - try: - llm_split = MetricsQueryService.get_llm_metrics_split( - org_id, - start_date, - end_date, - granularity, - ) - for name, data in llm_split.items(): - if not requested_metric or name == requested_metric: - series.append(_build_series_entry(name, data)) - except Exception: - logger.exception("Failed to fetch LLM metrics") - for name in llm_metric_keys: - if not requested_metric or name == requested_metric: - errors.append(name) - series.append(_build_error_entry(name)) + llm_series, llm_errors = _fetch_llm_series( + org_id, start_date, end_date, granularity, requested_metric, llm_metric_keys + ) + series.extend(llm_series) + errors.extend(llm_errors) # Filter non-LLM metrics if a specific metric was requested if requested_metric:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 76 - 127, The function _fetch_live_series is too complex; extract the LLM-specific block (the try/except that calls MetricsQueryService.get_llm_metrics_split, iterates llm_split, appends _build_series_entry or on error logs via logger.exception and appends _build_error_entry for each MetricsQueryService.LLM_METRIC_KEYS item) into a new helper function (e.g., _fetch_llm_series) that accepts org_id, start_date, end_date, granularity, requested_metric and returns (list_of_series, list_of_error_names); then replace the original LLM block in _fetch_live_series with a single call to that helper and merge its returned series/errors with the rest of the function, keeping existing use of _build_series_entry, _build_error_entry and logger.exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 76-127: The function _fetch_live_series is too complex; extract
the LLM-specific block (the try/except that calls
MetricsQueryService.get_llm_metrics_split, iterates llm_split, appends
_build_series_entry or on error logs via logger.exception and appends
_build_error_entry for each MetricsQueryService.LLM_METRIC_KEYS item) into a new
helper function (e.g., _fetch_llm_series) that accepts org_id, start_date,
end_date, granularity, requested_metric and returns (list_of_series,
list_of_error_names); then replace the original LLM block in _fetch_live_series
with a single call to that helper and merge its returned series/errors with the
rest of the function, keeping existing use of _build_series_entry,
_build_error_entry and logger.exception.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0bd16704-6ee8-43fe-a52c-ca5340d7c29e
📒 Files selected for processing (2)
backend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/dashboard_metrics/views.py (1)
72-79:⚠️ Potential issue | 🟡 MinorAdd
metric_typeto error entries for response consistency.The error entry is missing the
metric_typefield. When partial failures occur, the frontend receives inconsistent response shapes—successful entries havemetric_typewhile error entries don't. This was noted in a previous review.Proposed fix
def _build_error_entry(metric_name: str) -> dict: """Build an error series entry for a failed metric.""" return { "metric_name": metric_name, + "metric_type": _metric_type(metric_name), "error": "unavailable", "data": [], "total_value": 0, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 72 - 79, The error entry returned by _build_error_entry is missing the metric_type field, causing inconsistent response shapes; update _build_error_entry (function name: _build_error_entry) to accept an optional metric_type parameter (e.g., metric_type: Optional[str] = None) and include "metric_type": metric_type (or a default like "unavailable"/None) in the returned dict so error entries match the successful entries' shape.
🧹 Nitpick comments (1)
backend/dashboard_metrics/views.py (1)
82-134: Consider extracting helpers to reduce cognitive complexity.SonarCloud flags this function with cognitive complexity 22 (limit 15). The logic is correct, but the nested conditionals for LLM vs non-LLM paths and per-metric error handling contribute to the complexity.
Suggested refactor: extract LLM fetching
+def _fetch_llm_series( + org_id: str, + start_date: datetime, + end_date: datetime, + granularity: str, + requested_metric: str | None, +) -> tuple[list[dict], list[str]]: + """Fetch LLM metrics series.""" + series: list[dict] = [] + errors: list[str] = [] + llm_keys = MetricsQueryService.LLM_METRIC_KEYS + + try: + llm_split = MetricsQueryService.get_llm_metrics_split( + org_id, start_date, end_date, granularity + ) + for name, data in llm_split.items(): + if not requested_metric or name == requested_metric: + series.append(_build_series_entry(name, data)) + except Exception: + logger.exception("Failed to fetch LLM metrics") + for name in llm_keys: + if not requested_metric or name == requested_metric: + errors.append(name) + series.append(_build_error_entry(name)) + + return series, errors + + def _fetch_live_series( org_id: str, start_date: datetime, end_date: datetime, granularity: str, metric_queries: dict, requested_metric: str | None, ) -> tuple[list[dict], list[str]]: - """Fetch all metric series (LLM combined + individual). - - Returns: - Tuple of (series list, error names list). - """ + """Fetch all metric series (LLM combined + individual).""" series: list[dict] = [] errors: list[str] = [] - llm_metric_keys = MetricsQueryService.LLM_METRIC_KEYS - # Fetch all 4 LLM metrics in a single query - if not requested_metric or requested_metric in llm_metric_keys: - try: - llm_split = MetricsQueryService.get_llm_metrics_split( - org_id, - start_date, - end_date, - granularity, - ) - for name, data in llm_split.items(): - if not requested_metric or name == requested_metric: - series.append(_build_series_entry(name, data)) - except Exception: - logger.exception("Failed to fetch LLM metrics") - for name in llm_metric_keys: - if not requested_metric or name == requested_metric: - errors.append(name) - series.append(_build_error_entry(name)) + # Fetch LLM metrics + if not requested_metric or requested_metric in MetricsQueryService.LLM_METRIC_KEYS: + llm_series, llm_errors = _fetch_llm_series( + org_id, start_date, end_date, granularity, requested_metric + ) + series.extend(llm_series) + errors.extend(llm_errors) # Filter non-LLM metrics if a specific metric was requested ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 82 - 134, The _fetch_live_series function is too complex; extract the LLM-specific fetch and error handling into a helper to reduce nesting. Create a helper (e.g., fetch_llm_series(org_id, start_date, end_date, granularity, requested_metric)) that calls MetricsQueryService.get_llm_metrics_split, builds series entries with _build_series_entry, and on exception logs and returns error names and error entries using _build_error_entry; then simplify _fetch_live_series to call that helper when appropriate and only handle the remaining metric_queries loop (calling each query_fn and using _build_series_entry/_build_error_entry) so the main function becomes a simple orchestration of fetch_llm_series(...) + the per-metric loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 851-860: The metric mapping in metric_queries is missing the
failed_pages entry, causing live-series queries for "failed_pages" to return
empty; add "failed_pages": MetricsQueryService.get_failed_pages to the
metric_queries dict so that the failed_pages histogram is routed to
MetricsQueryService.get_failed_pages (update the metric_queries definition where
it lists documents_processed, pages_processed, etc., and ensure the key matches
"failed_pages").
---
Duplicate comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 72-79: The error entry returned by _build_error_entry is missing
the metric_type field, causing inconsistent response shapes; update
_build_error_entry (function name: _build_error_entry) to accept an optional
metric_type parameter (e.g., metric_type: Optional[str] = None) and include
"metric_type": metric_type (or a default like "unavailable"/None) in the
returned dict so error entries match the successful entries' shape.
---
Nitpick comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 82-134: The _fetch_live_series function is too complex; extract
the LLM-specific fetch and error handling into a helper to reduce nesting.
Create a helper (e.g., fetch_llm_series(org_id, start_date, end_date,
granularity, requested_metric)) that calls
MetricsQueryService.get_llm_metrics_split, builds series entries with
_build_series_entry, and on exception logs and returns error names and error
entries using _build_error_entry; then simplify _fetch_live_series to call that
helper when appropriate and only handle the remaining metric_queries loop
(calling each query_fn and using _build_series_entry/_build_error_entry) so the
main function becomes a simple orchestration of fetch_llm_series(...) + the
per-metric loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 04a9559c-afd2-425d-9cd7-053321776b6d
📒 Files selected for processing (1)
backend/dashboard_metrics/views.py
muhammad-ali-e
left a comment
There was a problem hiding this comment.
Code Review Findings
Critical Issues
1. Active org prefilter is too narrow (data integrity risk)
tasks.py ~line 460
The active_org_ids prefilter queries only WorkflowExecution.created_at >= daily_start (7 days). But several metrics query independent tables:
pages_processed→PageUsagellm_*metrics →Usagehitl_reviews,hitl_completions→HITLQueue
An org with recent HITL completions or LLM usage but no WorkflowExecution in the last 7 days gets silently skipped. Monthly totals go stale. The PR description acknowledges the 7-day window is intentional for dormant orgs, but the dormancy signal (WorkflowExecution) doesn't cover all metric sources.
Suggestion: Union org IDs from all source tables, or at minimum add Usage and HITLQueue to the prefilter query.
2. Race condition in _acquire_aggregation_lock()
tasks.py ~lines 223-240
The cache.delete() + cache.add() sequence is non-atomic. Between delete and add, another worker can grab the lock, leading to two concurrent aggregation runs.
Suggestion: Use Redis GETSET or SET ... NX EX via get_redis_connection() for an atomic compare-and-swap. Given this runs every 15 minutes and the window is small, this is low-probability but worth fixing since the whole point of the lock is correctness.
Medium Issues
3. failed_pages missing from metric_queries in views.py
views.py ~line 792-829
The metric_queries dict in live_series doesn't include "failed_pages": MetricsQueryService.get_failed_pages. This means GET /live-series?metric_name=failed_pages returns empty series. The tasks.py aggregation still handles it, but the live fallback path is broken.
4. Zeroed metrics for absent LLM data
tasks.py ~lines 360-383
The LLM combined query always returns all 4 fields per row. When a period has LLM calls but no challenges, challenges is 0 (not NULL). _upsert_agg still creates a record with value=0. This writes zero-value rows that didn't exist before. Not incorrect per se, but inflates row counts and changes behavior from the original code which only wrote rows when data existed.
Minor / Nitpicks
5. _metric_type() in views.py is incomplete
The _HISTOGRAM_METRICS frozenset covers llm_usage, pages_processed, failed_pages. This is correct today but is a second source of truth alongside LLM_METRIC_TYPES in backfill and llm_combined_fields in tasks. Consider a single canonical mapping.
6. ClassVar annotation for LLM_METRIC_TYPES
backfill_metrics.py line 61: LLM_METRIC_TYPES: dict[str, bool] is a mutable class attribute. Adding ClassVar would make the intent explicit for static analyzers.
7. Cognitive complexity
_aggregate_single_metric (16) and _aggregate_llm_combined (26) exceed typical thresholds. The shared _upsert_agg helper was a good step — extracting the daily+monthly fan-out into a similar helper would reduce nesting further.
What's Good
- Merging 4 LLM queries into 1 via Django conditional aggregation is clean and well-proven
- The daily+monthly query merge (reusing DAY granularity for both) is a sound optimization
_resolve_org_identifierwith pre-resolution avoids N+1 lookups on PageUsage- Self-healing lock concept is the right approach for handling OOM-killed workers
- PR description is exemplary — trade-offs, testing plan, and backwards compatibility are all addressed
- The
_upsert_aggextraction reduces code duplication nicely
| Severity | Count | Key Items |
|---|---|---|
| Critical | 2 | Narrow org prefilter skips metrics; lock race condition |
| Medium | 2 | Missing failed_pages in views; zero-value row inflation |
| Minor | 3 | Duplicate metric type sources; ClassVar; complexity |
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|
Greptile SummaryThis PR optimizes the periodic Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Celery as Celery Scheduler
participant Task as aggregate_metrics_from_sources
participant Lock as _acquire_aggregation_lock
participant Redis as Redis Cache
participant DB as PostgreSQL
participant Agg as Aggregation Tables
Celery->>Task: trigger (every 15 min)
Task->>Lock: _acquire_aggregation_lock()
Lock->>Redis: cache.add(key, timestamp, 900s)
alt Lock free
Redis-->>Lock: True (acquired)
Lock-->>Task: True
else Lock held
Redis-->>Lock: False
Lock->>Redis: cache.get(key)
alt Value is None (expired in race)
Redis-->>Lock: None
Lock-->>Task: False (skip — missed re-acquire opportunity)
else Value is old "running" string
Lock->>Redis: cache.delete + cache.add
Lock-->>Task: True (reclaimed)
else Timestamp value, age > 900s
Lock->>Redis: cache.delete + cache.add (non-atomic)
Lock-->>Task: True (stale reclaim)
else Lock is legitimate
Lock-->>Task: False (skip)
end
end
Task->>DB: WorkflowExecution filter (last 7 days) — active org IDs
DB-->>Task: active_org_ids set
loop For each active org
Task->>DB: get_llm_metrics_combined (HOUR granularity)
DB-->>Task: {period, llm_calls, challenges, summarization_calls, llm_usage}
Task->>DB: get_llm_metrics_combined (DAY granularity, from monthly_start)
DB-->>Task: rows split in Python → daily_agg + monthly_agg
loop For each non-LLM metric
Task->>DB: query_method (HOUR granularity)
DB-->>Task: hourly rows
Task->>DB: query_method (DAY granularity, from monthly_start)
DB-->>Task: rows split in Python → daily_agg + monthly_agg
end
Task->>Agg: bulk_create (EventMetricsHourly, ON CONFLICT UPDATE)
Task->>Agg: bulk_create (EventMetricsDaily, ON CONFLICT UPDATE)
Task->>Agg: bulk_create (EventMetricsMonthly, ON CONFLICT UPDATE)
end
Task->>Redis: cache.delete(lock_key)
|
| lock_value = cache.get(AGGREGATION_LOCK_KEY) | ||
| if lock_value is None: | ||
| # Expired between our check and get — race is fine, next run will pick up | ||
| return False |
There was a problem hiding this comment.
Missed re-acquisition on expired lock
When cache.add fails (lock held) and the subsequent cache.get returns None, the lock expired naturally between those two Redis calls — meaning the lock is now free. Returning False here causes the current invocation to skip aggregation entirely, even though there is no competing run. The next scheduled invocation 15 minutes later is the first that will successfully acquire.
Since the lock is demonstrably free at this point, the function should retry cache.add rather than aborting:
| lock_value = cache.get(AGGREGATION_LOCK_KEY) | |
| if lock_value is None: | |
| # Expired between our check and get — race is fine, next run will pick up | |
| return False | |
| lock_value = cache.get(AGGREGATION_LOCK_KEY) | |
| if lock_value is None: | |
| # Expired between our check and get — lock is now free, try to acquire it | |
| return cache.add(AGGREGATION_LOCK_KEY, str(now), AGGREGATION_LOCK_TIMEOUT) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/dashboard_metrics/tasks.py
Line: 228-231
Comment:
**Missed re-acquisition on expired lock**
When `cache.add` fails (lock held) and the subsequent `cache.get` returns `None`, the lock expired naturally between those two Redis calls — meaning the lock is now free. Returning `False` here causes the current invocation to skip aggregation entirely, even though there is no competing run. The next scheduled invocation 15 minutes later is the first that will successfully acquire.
Since the lock is demonstrably free at this point, the function should retry `cache.add` rather than aborting:
```suggestion
lock_value = cache.get(AGGREGATION_LOCK_KEY)
if lock_value is None:
# Expired between our check and get — lock is now free, try to acquire it
return cache.add(AGGREGATION_LOCK_KEY, str(now), AGGREGATION_LOCK_TIMEOUT)
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
The code already has an inline comment acknowledging this: "race is fine, next run will pick up". The task runs every 15 minutes with
idempotent upserts — skipping one cycle has zero impact.
| age = now - lock_time | ||
| if age > AGGREGATION_LOCK_TIMEOUT: | ||
| logger.warning( | ||
| "Reclaiming stale aggregation lock (age=%.0fs, timeout=%ds)", | ||
| age, | ||
| AGGREGATION_LOCK_TIMEOUT, | ||
| ) | ||
| cache.delete(AGGREGATION_LOCK_KEY) | ||
| return cache.add(AGGREGATION_LOCK_KEY, str(now), AGGREGATION_LOCK_TIMEOUT) |
There was a problem hiding this comment.
TOCTOU race in stale lock reclaim may allow duplicate concurrent runs
The stale-lock reclaim uses a non-atomic cache.delete + cache.add pattern combined with a wall-clock age check. This creates a real race window:
- The stale detection compares
now - lock_time(both from application-server clocks) againstAGGREGATION_LOCK_TIMEOUT. - If two Celery workers run on hosts whose clocks differ by even a few seconds, both workers can independently compute
age > AGGREGATION_LOCK_TIMEOUTwhile the legitimate holder is still running. - Worker A deletes the lock and wins the
cache.add. Worker B'saddfails and skips (good). But the original holder (Worker C, whose task is still running) no longer has a valid lock — Worker A is now running in parallel.
The delete → add two-step is not atomic in Redis. A correct approach would use a Lua CAS script:
# Atomic CAS: only delete and re-set if the current value matches what we read
LUA_RECLAIM = """
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('set', KEYS[1], ARGV[2], 'EX', ARGV[3])
else
return nil
end
"""If using Django's cache abstraction (which wraps Redis), calling cache.client.eval(LUA_RECLAIM, ...) or exposing the raw Redis client for this operation would make the reclaim atomic and eliminate the TOCTOU window.
Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/dashboard_metrics/tasks.py
Line: 241-249
Comment:
**TOCTOU race in stale lock reclaim may allow duplicate concurrent runs**
The stale-lock reclaim uses a non-atomic `cache.delete` + `cache.add` pattern combined with a wall-clock age check. This creates a real race window:
1. The stale detection compares `now - lock_time` (both from application-server clocks) against `AGGREGATION_LOCK_TIMEOUT`.
2. If two Celery workers run on hosts whose clocks differ by even a few seconds, both workers can independently compute `age > AGGREGATION_LOCK_TIMEOUT` while the legitimate holder is still running.
3. Worker A deletes the lock and wins the `cache.add`. Worker B's `add` fails and skips (good). But the original holder (Worker C, whose task is still running) no longer has a valid lock — Worker A is now running in parallel.
The `delete` → `add` two-step is not atomic in Redis. A correct approach would use a Lua CAS script:
```python
# Atomic CAS: only delete and re-set if the current value matches what we read
LUA_RECLAIM = """
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('set', KEYS[1], ARGV[2], 'EX', ARGV[3])
else
return nil
end
"""
```
If using Django's `cache` abstraction (which wraps Redis), calling `cache.client.eval(LUA_RECLAIM, ...)` or exposing the raw Redis client for this operation would make the reclaim atomic and eliminate the TOCTOU window.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Can you elaborate on what TOCTOU race condition is?
There was a problem hiding this comment.
TOCTOU = Time-of-Check, Time-of-Use race condition.
It happens when you check a condition, then act on it later, but the condition can change between the check and the action.
In your code:
Time-of-Check (lines 234-240):
lock_time = float(lock_value)
age = now - lock_time
if age > AGGREGATION_LOCK_TIMEOUT: # ← CHECK: "is this stale?"Time-of-Use (lines 241-249):
cache.delete(AGGREGATION_LOCK_KEY) # ← USE: delete the "stale" lock
return cache.add(AGGREGATION_LOCK_KEY, str(now), AGGREGATION_LOCK_TIMEOUT)The race window:
Two workers (A and B) both read the same stale lock_value at roughly the same time. Both compute age > AGGREGATION_LOCK_TIMEOUT and decide "this is stale, I should reclaim it." Then:
- Worker A deletes the lock
- Worker A adds its own lock (succeeds)
- Worker B deletes the lock (deletes A's new lock!)
- Worker B adds its own lock (succeeds)
Now both A and B think they own the lock and run the aggregation in parallel, which can corrupt your metrics.
The fix is to make the check-and-swap atomic using a Lua script (like a database compare-and-swap), so only the worker that sees the exact stale value can reclaim it. If the value changed (because another worker already reclaimed it), the operation fails.
There was a problem hiding this comment.
Even if two workers run
concurrently, all writes are INSERT...ON CONFLICT upserts, so results are correct (just redundant work). The suggested fix (Lua CAS script) is
over-engineered for a 15-minute batch job with idempotent operations.
muhammad-ali-e
left a comment
There was a problem hiding this comment.
Code Review: UN-3322 [FIX] Optimize dashboard metrics aggregation task
Summary
Well-structured optimization PR — consolidates 4 LLM queries into 1 via Django conditional aggregation, merges daily+monthly queries, adds self-healing Redis lock, and tightens the active-org window from 2 months to 7 days. PR description is exemplary with explicit trade-off documentation. Findings below went through adversarial debate + fact-checking to filter false positives.
Automated Checks
- ruff check: FAIL —
services.py:11unsorted imports (I001)
Findings
Medium
| # | File | Finding | Suggestion |
|---|---|---|---|
| 1 | tasks.py ~line 463 |
Active-org prefilter narrowed to 7 days may skip HITL-only orgs. The WorkflowExecution-based filter (same pattern as main) misses orgs with recent HITLQueue activity but no WorkflowExecution in the past 7 days. Stale hitl_reviews/hitl_completions metrics for those orgs. |
Consider unioning HITLQueue org IDs into active_org_ids for completeness. Low urgency — same base pattern exists on main. |
| 2 | tasks.py ~line 240 |
Lock reclaim uses non-atomic cache.delete() + cache.add(). Narrow race window where two workers both detect staleness. Worst case is redundant idempotent upserts (last-writer-wins), not data corruption. Strictly better than main (which has zero stale-lock recovery). |
Consider cache.set() directly after staleness check instead of delete+add, or use django-redis lock primitive. Follow-up, not urgent. |
Minor (pre-existing patterns, not introduced by this PR)
| # | File | Finding | Suggestion |
|---|---|---|---|
| 3 | services.py:219,227-228 |
Hardcoded strings "llm", "challenge", "summarize" — UsageType and LLMUsageReason enums exist in usage_v2.models. Pre-existing from old code, carried into combined query. |
Use existing enums when convenient. |
| 4 | services.py:337,625 |
Hardcoded "ETL" string — Pipeline.PipelineType.ETL enum exists. Pre-existing. |
Use the enum. |
| 5 | tasks.py + backfill_metrics.py |
Three _bulk_upsert_* functions duplicated (~90 lines) between tasks.py and backfill command. Also _truncate_to_hour duplicated. |
Extract to shared dashboard_metrics/aggregation.py module. |
| 6 | tasks.py:563 |
logger.exception(f"...{e}") anti-pattern — redundant since logger.exception already captures traceback. Bypasses lazy formatting. |
Use logger.exception("Error processing org %s", org_id). |
| 7 | views.py:698 |
Stale docstring: "all 9 metrics" — there are now 12. | Update count or drop the number. |
What's Good
- Merging 4 LLM queries into 1 via
Count(filter=Q(...))conditional aggregation is clean - Daily+monthly query merge (reusing DAY granularity for both tiers) is a sound optimization
_resolve_org_identifierwith pre-resolution avoids N+1 lookups on PageUsage- Self-healing lock is the right approach for OOM-killed workers — strictly better than main
_upsert_aggextraction reduces code duplication nicely- PR description documents trade-offs, testing plan, and backward compatibility thoroughly
Debated & Dismissed (false positives)
UUID vs string mismatch in backfill— Organization.id isorg_identifiersBigAutoField(integer), not UUID. Theisdigit()+int()conversion handles it correctly.— was never infailed_pagesmissing fromlive_serieslive_serieson main either. Not a regression.



What
aggregate_metrics_from_sourcesCelery task to reduce DB queries by ~55%Why
redis-cli DEL) when workers were OOM-killedHow
Count(filter=Q(...)),Sum) — saves 9 queries/orgOrganization.organization_idstring once per org for PageUsage queries instead of per-metric DB lookup — saves 5 lookups/orgWorkflowExecution.created_at__gte=daily_start(7 days) instead ofmonthly_start(2 months). Dormant orgs re-enter the window when activity resumes; monthly totals were already persisted by prior runstime.time()as lock value. On contention, check age > timeout → reclaim stale lock. Handles migration from old"running"string value viaTypeError/ValueErrorbranchorg_identifierparameter fromget_failed_pages(joins through workflow FK, not PageUsage)Trade-off: Performance vs Stability
This PR trades marginal stability for significant performance gains. Specifically:
usagetable has a schema issue (e.g., corruptedllm_usage_reasoncolumn), all 4 LLM metrics fail together instead of independently. Mitigation: the outertry/exceptper org catches this and logs the error — other orgs and non-LLM metrics are unaffected.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
get_llm_calls,get_challenges, etc.) preserve backward compatibility for views.py and backfill command"running"string in RedisDatabase Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Aggregation: X active orgs out of Y total— X will be lower than before (7-day vs 2-month window), this is expectedredis-cli SET "dashboard_metrics:aggregation_lock" "running", verify next run reclaims it and logs a warningpython manage.py backfill_metrics --days=1 --dry-runScreenshots
N/A (backend-only change)
Checklist
I have read and understood the Contribution Guidelines.