Conversation
c5e302a to
312d056
Compare
312d056 to
aa2b198
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: Unconditional import of optional dependency breaks metrics package
- I wrapped the
RapidataMetricimport inmetrics/__init__.pywith aModuleNotFoundErrorguard for missingrapidataand only export it when available.
- I wrapped the
- ✅ Fixed: Missing newline between concatenated warning message strings
- I added the missing newline separator in the warning string so the message now renders as separate sentences.
- ✅ Fixed: Missing benchmark validation in
compute()method- I added
self._require_benchmark()at the start ofcompute()so missing benchmark state raises the intendedValueError.
- I added
Or push these changes by commenting:
@cursor push fc62a066b6
Preview (fc62a066b6)
diff --git a/src/pruna/evaluation/metrics/__init__.py b/src/pruna/evaluation/metrics/__init__.py
--- a/src/pruna/evaluation/metrics/__init__.py
+++ b/src/pruna/evaluation/metrics/__init__.py
@@ -22,10 +22,15 @@
from pruna.evaluation.metrics.metric_memory import DiskMemoryMetric, InferenceMemoryMetric, TrainingMemoryMetric
from pruna.evaluation.metrics.metric_model_architecture import TotalMACsMetric, TotalParamsMetric
from pruna.evaluation.metrics.metric_pairwise_clip import PairwiseClipScore
-from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric
from pruna.evaluation.metrics.metric_sharpness import SharpnessMetric
from pruna.evaluation.metrics.metric_torch import TorchMetricWrapper
+try:
+ from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric
+except ModuleNotFoundError as e:
+ if e.name != "rapidata":
+ raise
+
__all__ = [
"MetricRegistry",
"TorchMetricWrapper",
@@ -44,5 +49,7 @@
"DinoScore",
"SharpnessMetric",
"AestheticLAION",
- "RapidataMetric",
]
+
+if "RapidataMetric" in globals():
+ __all__.append("RapidataMetric")
diff --git a/src/pruna/evaluation/metrics/metric_rapiddata.py b/src/pruna/evaluation/metrics/metric_rapiddata.py
--- a/src/pruna/evaluation/metrics/metric_rapiddata.py
+++ b/src/pruna/evaluation/metrics/metric_rapiddata.py
@@ -299,6 +299,7 @@
:meth:`retrieve_granular_results` once enough votes have been
collected.
"""
+ self._require_benchmark()
self._require_model()
if not self.media_cache:
raise ValueError("No data accumulated. Call update() before compute().")
@@ -348,7 +349,7 @@
if "ValidationError" in type(e).__name__:
pruna_logger.warning(
"The benchmark hasn't finished yet.\n "
- "Please wait for more votes and try again."
+ "Please wait for more votes and try again.\n "
"Skipping."
)
return NoneThis Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| from pruna.evaluation.metrics.metric_memory import DiskMemoryMetric, InferenceMemoryMetric, TrainingMemoryMetric | ||
| from pruna.evaluation.metrics.metric_model_architecture import TotalMACsMetric, TotalParamsMetric | ||
| from pruna.evaluation.metrics.metric_pairwise_clip import PairwiseClipScore | ||
| from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric |
There was a problem hiding this comment.
Unconditional import of optional dependency breaks metrics package
High Severity
RapidataMetric is unconditionally imported in __init__.py, but metric_rapidata.py has top-level imports of rapidata (an optional dependency under [project.optional-dependencies]). This causes an ImportError for any user who hasn't installed the rapidata extra, breaking the entire pruna.evaluation.metrics package — including unrelated metrics like MetricRegistry, CMMD, DinoScore, etc. Other modules like benchmarks.py also import from this package and would break.
Additional Locations (1)
| "The benchmark hasn't finished yet.\n " | ||
| "Please wait for more votes and try again." | ||
| "Skipping." | ||
| ) |
There was a problem hiding this comment.
Missing newline between concatenated warning message strings
Low Severity
Adjacent string literals on lines 351–352 are implicitly concatenated without a separator, producing "Please wait for more votes and try again.Skipping.". A \n is likely intended before "Skipping." to match the formatting pattern used elsewhere in this message and throughout the file.
| "https://app.rapidata.ai/mri/benchmarks/%s", | ||
| self.current_benchmarked_model, | ||
| self.benchmark.id, | ||
| ) |
There was a problem hiding this comment.
Missing benchmark validation in compute() method
Low Severity
The compute() method accesses self.benchmark.evaluate_model(...) and self.benchmark.id without calling _require_benchmark() first. Every other public method that accesses self.benchmark — create_request(), update(), retrieve_results(), retrieve_granular_results() — properly calls _require_benchmark(). This inconsistency means a user who calls compute() without a benchmark configured gets a confusing AttributeError on NoneType instead of the clear ValueError produced by _require_benchmark().
davidberenstein1957
left a comment
There was a problem hiding this comment.
Thanks for the PR, left some comments. I feel that some documentation is required for this too right, or is this generated automatically?
| """ | ||
|
|
||
| media_cache: List[torch.Tensor | PIL.Image.Image | str] | ||
| prompt_cache: List[str] |
There was a problem hiding this comment.
we don't do image editing yet?
There was a problem hiding this comment.
hmm maybe I don't understand the question, but rapidata doesn't support image editing models, so the input can only be "prompts", which makes it not possible to support image editing!
There was a problem hiding this comment.
So, currently it seems we can pass prompts and media as generations, but I understood they also support image-editing tasks where we'd pass custom formatting, but this is not something we want to support for now?
| default_call_type: str = "x_y" | ||
| higher_is_better: bool = True | ||
| metric_name: str = METRIC_RAPIDATA | ||
| runs_on: List[str] = ["cpu", "cuda"] |
There was a problem hiding this comment.
why do we need cuda?
There was a problem hiding this comment.
We don't need it but we support it! do you think it should be only supported in cpu?
There was a problem hiding this comment.
It can be both in this case, or should it require CUDA? I assumed it was a recommendation, but was not sure.
There was a problem hiding this comment.
does it make sense to add something like runs_on "any"
There was a problem hiding this comment.
oh you are so correct, because by default the runs_on in StatefulMetric is configured to run on everything. So I could remove this parameter from this all together actually!
| vbench = [ | ||
| "vbench-pruna; sys_platform != 'darwin'", | ||
| ] | ||
| rapidata = [ |
There was a problem hiding this comment.
alright, so we already start with the extra seperation, nice!
@begumcig I know that we can also do something like.
evaluation = [
rapidata,
vbench
]
could be nice to already start structuring like this, right?There was a problem hiding this comment.
Hmm, I don't think vbench and rapidata have a lot of shared dependencies, so doesn't really make sense to me to group them together
There was a problem hiding this comment.
I assumed it could be a shared dependency group for all evaluation metrics. You can add extras to extras, but perhaps you'd like to keep them seperate?
| self.benchmark.id, | ||
| ) | ||
|
|
||
| def retrieve_results(self, *args, **kwargs) -> CompositeMetricResult | None: |
There was a problem hiding this comment.
It could be nice to add the functionality to wait till done, WDYT? It feels a bit harsh to fail outright without giving the option to do so.
There was a problem hiding this comment.
Hmm, do you think it makes sense to wait 15 mins - 1 hour?
If we do not have any results yet I am catching the error that rapidata is throwing, supressing it, and returning None, as a result. I thought it to be a middle ground between failing, and waiting indefinitely, I think the user also gets some sort of notification from the platform when the benchmark is finished anyway, does it make sense to wait for a long time?
There was a problem hiding this comment.
I'm not sure, but currently, you have to check, then get an error, then wait 15 minutes, and then manually check again.
Applied via @cursor push command
simlang
left a comment
There was a problem hiding this comment.
just a first quick pass, david already covered most of my comments, so waiting for the second iteration!
but already looks super amazing 💅
| The keyword arguments to pass to the metric. | ||
| """ | ||
|
|
||
| def set_current_context(self, *args, **kwargs) -> None: |
There was a problem hiding this comment.
yes agree, maybe also with a mixin, as you did for the async metric. some kind of MulitStateMetricMixin or something
| ---------- | ||
| call_type : str | ||
| How to extract inputs from (x, gt, outputs). Default is "single". | ||
| client : RapidataClient | None |
There was a problem hiding this comment.
similar to what we spoke about with benchmark and benchmark_id
maybe a single rapidata client, which can be RapidataClient | str | None? and then based on the type do different things
| default_call_type: str = "x_y" | ||
| higher_is_better: bool = True | ||
| metric_name: str = METRIC_RAPIDATA | ||
| runs_on: List[str] = ["cpu", "cuda"] |
There was a problem hiding this comment.
does it make sense to add something like runs_on "any"
|
|
||
| self.benchmark = self.client.mri.create_new_benchmark(name, prompts=data, **kwargs) | ||
|
|
||
| def create_request( |
There was a problem hiding this comment.
agree that the naming is not optimal, as we already have requests which are something different - maybe something which includes async?
| self._require_benchmark() | ||
| self.benchmark.create_leaderboard(name, instruction, show_prompt, **kwargs) | ||
|
|
||
| def set_current_context(self, model_name: str, **kwargs) -> None: |
There was a problem hiding this comment.
i guess rapidate only supports comparing two models, right?
i had a mulitstate (better name maybe multimodel idk) mixin comment before. does it make sense to have a max_number of models in there? so we keep all contexts and if a user tries to add a third model we say nope?
There was a problem hiding this comment.
You can actually compare as many models as you like!
|
|
||
| self._cleanup_temp_media() | ||
|
|
||
| pruna_logger.info( |
There was a problem hiding this comment.
i think i disagree, but just from the naming.
it is an info, but it probably should be printed by default - so like a warning, hahahaha
2f66795 to
a9f0e40
Compare



Description
This PR introduces a new stateful, asynchronous metric that submits generative model outputs (images, videos, etc.) to the Rapidata platform for human evaluation. Raters compare outputs across models on configurable criteria (e.g. image quality, prompt alignment), and results are retrieved later once enough votes are collected.
New Stuffs Implemented
New
AsyncEvaluationMixin: an abstract mixin defining the create_request() / retrieve_results() contract for metrics that delegate evaluation to external services asynchronously.New
CompositeMetricResult: a result type for metrics that return multiple labeled scores (e.g. one score per model), alongside a MetricResultProtocol to unify the interface between MetricResult and CompositeMetricResult.EvaluationAgent updates: the agent now calls set_current_context(model_name=...) on all stateful metrics before each evaluation run, accepts a model_name parameter in evaluate(), and handles None returns from compute() (for async metrics that don't produce immediate results).
Details
RapidataMetric lifecycle:
Other changes:
Added set_current_context() as a no-op hook on StatefulMetric so the agent can uniformly notify all metrics of the current model without changing the structure of update() and compute()
Typed EvaluationAgent result lists as MetricResultProtocol instead of concrete MetricResult
Added rapidata as an optional dependency extra in pyproject.toml
Updated CI to install the rapidata extra
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
Usage