Conversation
f8de803 to
e996238
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new remote seed-dataset provider for the HuggingFace declare-lab/HarmfulQA dataset so it can be fetched and used as SeedPrompt entries within PyRIT’s dataset discovery/registration system.
Changes:
- Introduced
_HarmfulQADatasetremote loader that fetches HarmfulQA from HuggingFace and converts rows intoSeedPrompts. - Exported the new loader from
pyrit.datasets.seed_datasets.remoteto trigger auto-registration. - Added unit tests validating basic fetch + conversion behavior and
dataset_name.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pyrit/datasets/seed_datasets/remote/harmful_qa_dataset.py | New remote dataset loader implementation for HarmfulQA -> SeedDataset/SeedPrompt conversion. |
| pyrit/datasets/seed_datasets/remote/init.py | Re-export/import the new loader so it’s discoverable/registered alongside other remote loaders. |
| tests/unit/datasets/test_harmful_qa_dataset.py | Unit tests for fetching/conversion and dataset_name behavior. |
| dataset_name=self.dataset_name, | ||
| harm_categories=[item["topic"]], | ||
| description=description, | ||
| source="https://huggingface.co/datasets/declare-lab/HarmfulQA", | ||
| authors=authors, |
There was a problem hiding this comment.
The source field is hard-coded to the declare-lab/HarmfulQA URL, but this loader exposes dataset_name as a constructor parameter (self.hf_dataset_name). If callers override dataset_name, the returned SeedPrompts will still claim the default source, which is misleading. Consider deriving source (and the HF URL) from self.hf_dataset_name, or accept an explicit source_url parameter.
| source="https://huggingface.co/datasets/declare-lab/HarmfulQA", | ||
| authors=authors, | ||
| groups=["DeCLaRe Lab, Singapore University of Technology and Design"], | ||
| metadata={"subtopic": item.get("subtopic", "")}, |
There was a problem hiding this comment.
groups=[item.get('subtopic', '')] will produce [''] when subtopic is missing/empty, which creates an empty-named group downstream. Use a conditional (e.g., [] when falsy) so groups is either a meaningful label or omitted.
| metadata={"subtopic": item.get("subtopic", "")}, | |
| metadata=( | |
| {"subtopic": item["subtopic"]} | |
| if item.get("subtopic") | |
| else {} | |
| ), |
Add remote dataset loader for HarmfulQA (declare-lab/HarmfulQA), containing ~2k harmful questions organized by academic topic and subtopic for testing LLM susceptibility to harm-inducing question-answering. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e996238 to
d441180
Compare
Add remote dataset loader for HarmfulQA (declare-lab/HarmfulQA), containing ~2k harmful questions organized by academic topic and subtopic for testing LLM susceptibility to harm-inducing question-answering.