feat: initial implementation for rapidata by begumcig · Pull Request #581 · PrunaAI/pruna

begumcig · 2026-03-19T23:43:24Z

Description

This PR introduces a new stateful, asynchronous metric that submits generative model outputs (images, videos, etc.) to the Rapidata platform for human evaluation. Raters compare outputs across models on configurable criteria (e.g. image quality, prompt alignment), and results are retrieved later once enough votes are collected.

New Stuffs Implemented

New AsyncEvaluationMixin: an abstract mixin defining the create_request() / retrieve_results() contract for metrics that delegate evaluation to external services asynchronously.
New CompositeMetricResult: a result type for metrics that return multiple labeled scores (e.g. one score per model), alongside a MetricResultProtocol to unify the interface between MetricResult and CompositeMetricResult.
EvaluationAgent updates: the agent now calls set_current_context(model_name=...) on all stateful metrics before each evaluation run, accepts a model_name parameter in evaluate(), and handles None returns from compute() (for async metrics that don't produce immediate results).

Details

RapidataMetric lifecycle:

Authenticate via client ID/secret or interactive browser login
Create a benchmark from a prompt list or PrunaDataModule (or attach an existing one via from_benchmark() / from_benchmark_id())
Create one or more leaderboards, each with a different evaluation instruction
For each model: accumulate outputs via update(), submit via compute()
Retrieve aggregated or per-leaderboard results once human evaluation completes
Media handling supports str (URLs/paths), PIL.Image, and torch.Tensor: tensors and PIL images are saved to a temp directory for upload, then cleaned up.

Other changes:

Added set_current_context() as a no-op hook on StatefulMetric so the agent can uniformly notify all metrics of the current model without changing the structure of update() and compute()
Typed EvaluationAgent result lists as MetricResultProtocol instead of concrete MetricResult
Added rapidata as an optional dependency extra in pyproject.toml
Updated CI to install the rapidata extra

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

Usage

from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric

# Initialization
rdm = RapidataMetric()
rdm.create_benchmark("my_bench_standalone", prompt_to_test)
rdm.create_request("Quality", instruction="Which video looks better?")

#  Use with Evaluation agent
agent = EvaluationAgent(
    request=[rapidata_metric],
    datamodule=datamodule,
)

results_a = agent.evaluate(model_a, model_name="model_a")

results_b = agent.evaluate(model_b, model_name="model_b")

# Use standalone
rdm.set_current_context("model_a")
rdm.update(prompts, model_a_gt, model_a_outputs)
rdm.compute()

rdm.set_current_context("model_b")
rdm.update(prompts, model_b_gt, model_b_outputs)
rdm.compute()

# Retrieve rankings
rdm.retrieve_results()

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Unconditional import of optional dependency breaks metrics package
- I wrapped the RapidataMetric import in metrics/__init__.py with a ModuleNotFoundError guard for missing rapidata and only export it when available.
✅ Fixed: Missing newline between concatenated warning message strings
- I added the missing newline separator in the warning string so the message now renders as separate sentences.
✅ Fixed: Missing benchmark validation in compute() method
- I added self._require_benchmark() at the start of compute() so missing benchmark state raises the intended ValueError.

Or push these changes by commenting:

@cursor push fc62a066b6

Preview (fc62a066b6)

diff --git a/src/pruna/evaluation/metrics/__init__.py b/src/pruna/evaluation/metrics/__init__.py
--- a/src/pruna/evaluation/metrics/__init__.py
+++ b/src/pruna/evaluation/metrics/__init__.py
@@ -22,10 +22,15 @@
 from pruna.evaluation.metrics.metric_memory import DiskMemoryMetric, InferenceMemoryMetric, TrainingMemoryMetric
 from pruna.evaluation.metrics.metric_model_architecture import TotalMACsMetric, TotalParamsMetric
 from pruna.evaluation.metrics.metric_pairwise_clip import PairwiseClipScore
-from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric
 from pruna.evaluation.metrics.metric_sharpness import SharpnessMetric
 from pruna.evaluation.metrics.metric_torch import TorchMetricWrapper
 
+try:
+    from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric
+except ModuleNotFoundError as e:
+    if e.name != "rapidata":
+        raise
+
 __all__ = [
     "MetricRegistry",
     "TorchMetricWrapper",
@@ -44,5 +49,7 @@
     "DinoScore",
     "SharpnessMetric",
     "AestheticLAION",
-    "RapidataMetric",
 ]
+
+if "RapidataMetric" in globals():
+    __all__.append("RapidataMetric")

diff --git a/src/pruna/evaluation/metrics/metric_rapiddata.py b/src/pruna/evaluation/metrics/metric_rapiddata.py
--- a/src/pruna/evaluation/metrics/metric_rapiddata.py
+++ b/src/pruna/evaluation/metrics/metric_rapiddata.py
@@ -299,6 +299,7 @@
         :meth:`retrieve_granular_results` once enough votes have been
         collected.
         """
+        self._require_benchmark()
         self._require_model()
         if not self.media_cache:
             raise ValueError("No data accumulated. Call update() before compute().")
@@ -348,7 +349,7 @@
             if "ValidationError" in type(e).__name__:
                 pruna_logger.warning(
                     "The benchmark hasn't finished yet.\n "
-                    "Please wait for more votes and try again."
+                    "Please wait for more votes and try again.\n "
                     "Skipping."
                 )
                 return None

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-03-20T11:09:34Z

src/pruna/evaluation/metrics/__init__.py

 from pruna.evaluation.metrics.metric_memory import DiskMemoryMetric, InferenceMemoryMetric, TrainingMemoryMetric
 from pruna.evaluation.metrics.metric_model_architecture import TotalMACsMetric, TotalParamsMetric
 from pruna.evaluation.metrics.metric_pairwise_clip import PairwiseClipScore
+from pruna.evaluation.metrics.metric_rapiddata import RapidataMetric


Unconditional import of optional dependency breaks metrics package

High Severity

RapidataMetric is unconditionally imported in __init__.py, but metric_rapidata.py has top-level imports of rapidata (an optional dependency under [project.optional-dependencies]). This causes an ImportError for any user who hasn't installed the rapidata extra, breaking the entire pruna.evaluation.metrics package — including unrelated metrics like MetricRegistry, CMMD, DinoScore, etc. Other modules like benchmarks.py also import from this package and would break.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_rapiddata.py#L23-L25

cursor · 2026-03-20T11:09:34Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+                    "The benchmark hasn't finished yet.\n "
+                    "Please wait for more votes and try again."
+                    "Skipping."
+                )


Missing newline between concatenated warning message strings

Low Severity

Adjacent string literals on lines 351–352 are implicitly concatenated without a separator, producing "Please wait for more votes and try again.Skipping.". A \n is likely intended before "Skipping." to match the formatting pattern used elsewhere in this message and throughout the file.

cursor · 2026-03-20T11:09:34Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+            "https://app.rapidata.ai/mri/benchmarks/%s",
+            self.current_benchmarked_model,
+            self.benchmark.id,
+        )


Missing benchmark validation in compute() method

Low Severity

The compute() method accesses self.benchmark.evaluate_model(...) and self.benchmark.id without calling _require_benchmark() first. Every other public method that accesses self.benchmark — create_request(), update(), retrieve_results(), retrieve_granular_results() — properly calls _require_benchmark(). This inconsistency means a user who calls compute() without a benchmark configured gets a confusing AttributeError on NoneType instead of the clear ValueError produced by _require_benchmark().

davidberenstein1957

Thanks for the PR, left some comments. I feel that some documentation is required for this too right, or is this generated automatically?

src/pruna/evaluation/metrics/metric_rapiddata.py

davidberenstein1957 · 2026-03-20T11:28:03Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+    """
+
+    media_cache: List[torch.Tensor | PIL.Image.Image | str]
+    prompt_cache: List[str]


we don't do image editing yet?

hmm maybe I don't understand the question, but rapidata doesn't support image editing models, so the input can only be "prompts", which makes it not possible to support image editing!

So, currently it seems we can pass prompts and media as generations, but I understood they also support image-editing tasks where we'd pass custom formatting, but this is not something we want to support for now?

davidberenstein1957 · 2026-03-20T11:28:12Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+    default_call_type: str = "x_y"
+    higher_is_better: bool = True
+    metric_name: str = METRIC_RAPIDATA
+    runs_on: List[str] = ["cpu", "cuda"]


why do we need cuda?

We don't need it but we support it! do you think it should be only supported in cpu?

It can be both in this case, or should it require CUDA? I assumed it was a recommendation, but was not sure.

does it make sense to add something like runs_on "any"

oh you are so correct, because by default the runs_on in StatefulMetric is configured to run on everything. So I could remove this parameter from this all together actually!

src/pruna/evaluation/metrics/metric_rapiddata.py

src/pruna/evaluation/metrics/result.py

src/pruna/evaluation/evaluation_agent.py

davidberenstein1957 · 2026-03-20T11:56:35Z

pyproject.toml

 vbench = [
    "vbench-pruna; sys_platform != 'darwin'",
 ]
+rapidata = [


alright, so we already start with the extra seperation, nice!
@begumcig I know that we can also do something like.

evaluation = [ rapidata, vbench ] could be nice to already start structuring like this, right?

Hmm, I don't think vbench and rapidata have a lot of shared dependencies, so doesn't really make sense to me to group them together

I assumed it could be a shared dependency group for all evaluation metrics. You can add extras to extras, but perhaps you'd like to keep them seperate?

src/pruna/evaluation/metrics/metric_rapiddata.py

davidberenstein1957 · 2026-03-20T11:59:18Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+            self.benchmark.id,
+        )
+
+    def retrieve_results(self, *args, **kwargs) -> CompositeMetricResult | None:


It could be nice to add the functionality to wait till done, WDYT? It feels a bit harsh to fail outright without giving the option to do so.

Hmm, do you think it makes sense to wait 15 mins - 1 hour?
If we do not have any results yet I am catching the error that rapidata is throwing, supressing it, and returning None, as a result. I thought it to be a middle ground between failing, and waiting indefinitely, I think the user also gets some sort of notification from the platform when the benchmark is finished anyway, does it make sense to wait for a long time?

I'm not sure, but currently, you have to check, then get an error, then wait 15 minutes, and then manually check again.

begumcig · 2026-03-20T13:06:48Z

@cursor push fc62a06

Applied via @cursor push command

simlang

just a first quick pass, david already covered most of my comments, so waiting for the second iteration!
but already looks super amazing 💅

simlang · 2026-03-20T14:01:47Z

src/pruna/evaluation/metrics/metric_stateful.py

            The keyword arguments to pass to the metric.
        """

+    def set_current_context(self, *args, **kwargs) -> None:


yes agree, maybe also with a mixin, as you did for the async metric. some kind of MulitStateMetricMixin or something

simlang · 2026-03-20T14:08:31Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+    ----------
+    call_type : str
+        How to extract inputs from (x, gt, outputs). Default is "single".
+    client : RapidataClient | None


similar to what we spoke about with benchmark and benchmark_id
maybe a single rapidata client, which can be RapidataClient | str | None? and then based on the type do different things

simlang · 2026-03-20T14:12:58Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+    default_call_type: str = "x_y"
+    higher_is_better: bool = True
+    metric_name: str = METRIC_RAPIDATA
+    runs_on: List[str] = ["cpu", "cuda"]


does it make sense to add something like runs_on "any"

simlang · 2026-03-20T14:17:23Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+
+        self.benchmark = self.client.mri.create_new_benchmark(name, prompts=data, **kwargs)
+
+    def create_request(


agree that the naming is not optimal, as we already have requests which are something different - maybe something which includes async?

simlang · 2026-03-20T14:20:46Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+        self._require_benchmark()
+        self.benchmark.create_leaderboard(name, instruction, show_prompt, **kwargs)
+
+    def set_current_context(self, model_name: str, **kwargs) -> None:


i guess rapidate only supports comparing two models, right?
i had a mulitstate (better name maybe multimodel idk) mixin comment before. does it make sense to have a max_number of models in there? so we keep all contexts and if a user tries to add a third model we say nope?

You can actually compare as many models as you like!

simlang · 2026-03-20T14:23:26Z

src/pruna/evaluation/metrics/metric_rapiddata.py

+
+        self._cleanup_temp_media()
+
+        pruna_logger.info(


i think i disagree, but just from the naming.
it is an info, but it probably should be printed by default - so like a warning, hahahaha

feat: initial implementation for rapidata

24fdba8

begumcig force-pushed the feat/rapiddata-metric branch 2 times, most recently from c5e302a to 312d056 Compare March 20, 2026 10:48

ci: add rapidata dependency and some cleanup

aa2b198

begumcig force-pushed the feat/rapiddata-metric branch from 312d056 to aa2b198 Compare March 20, 2026 11:01

begumcig requested review from davidberenstein1957, sharpenb and simlang March 20, 2026 11:03

begumcig marked this pull request as ready for review March 20, 2026 11:03

cursor bot reviewed Mar 20, 2026

View reviewed changes

davidberenstein1957 reviewed Mar 20, 2026

View reviewed changes

Guard optional rapidata metric import and tighten validation

06109e9

Applied via @cursor push command

simlang reviewed Mar 20, 2026

View reviewed changes

refactor: address PR comments

a9f0e40

begumcig force-pushed the feat/rapiddata-metric branch from 2f66795 to a9f0e40 Compare March 20, 2026 16:33


		self.benchmark = self.client.mri.create_new_benchmark(name, prompts=data, **kwargs)

		def create_request(

Conversation

begumcig commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

New Stuffs Implemented

Details

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Usage

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

Unconditional import of optional dependency breaks metrics package

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

Missing newline between concatenated warning message strings

Uh oh!

cursor bot Mar 20, 2026

Choose a reason for hiding this comment

Missing benchmark validation in compute() method

Uh oh!

davidberenstein1957 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig commented Mar 20, 2026

Uh oh!

simlang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

begumcig commented Mar 19, 2026 •

edited

Loading

cursor bot left a comment •

edited

Loading

Missing benchmark validation in `compute()` method

davidberenstein1957 Mar 20, 2026 •

edited

Loading