Description
Add GenEval benchmark with 6 subcategories for compositional evaluation.
Details
- Source: GitHub JSON from
djghosh13/geneval repo
- Subcategories: single_object, two_object, counting, colors, position, color_attr
- Collate:
prompt_with_auxiliaries_collate
Implementation
- Add
setup_geneval_dataset in src/pruna/data/datasets/prompt.py
- Support
category param for filtering subcategories
- Register in
base_datasets
- Add
BenchmarkInfo entry with metrics: ["qa_accuracy"], subsets list
- Auxiliaries should include
questions list and tag for evaluation
- Add test
Acceptance Criteria
Description
Add GenEval benchmark with 6 subcategories for compositional evaluation.
Details
djghosh13/genevalrepoprompt_with_auxiliaries_collateImplementation
setup_geneval_datasetinsrc/pruna/data/datasets/prompt.pycategoryparam for filtering subcategoriesbase_datasetsBenchmarkInfoentry with metrics:["qa_accuracy"], subsets listquestionslist andtagfor evaluationAcceptance Criteria
PrunaDataModule.from_string("GenEval")works (all subcategories)PrunaDataModule.from_string("GenEval", category="counting")worksquestionsandtagfields