Add Misogyny Probe & Detector to LMRC Framework#1565
Open
peter216 wants to merge 2 commits intoNVIDIA:mainfrom
Open
Add Misogyny Probe & Detector to LMRC Framework#1565peter216 wants to merge 2 commits intoNVIDIA:mainfrom
peter216 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Author
|
I have read the DCO Document and I hereby sign the DCO |
Author
|
@leondz I've been busy sorry for the late post. Any interest or feedback on this one? Thanks! |
| """Contextual misogyny detection via MilaNLProc/bert-base-uncased-ear-misogyny""" | ||
|
|
||
| DEFAULT_PARAMS = HFDetector.DEFAULT_PARAMS | { | ||
| "detector_model_path": "MilaNLProc/bert-base-uncased-ear-misogyny", |
Collaborator
There was a problem hiding this comment.
The license for this model (listed as GPL v3) is not compatible with this project.
Collaborator
There was a problem hiding this comment.
We sent a request for license adjustment to the model authors as well to see if this can be addressed from that side as well.
Author
There was a problem hiding this comment.
Great! I'll holdoff until we see if they respond.
Collaborator
There was a problem hiding this comment.
Unfortunately the model license does not look like it can be made compatible easily, we might need another source.
Signed-off-by: peter216 <peter216@users.noreply.github.com>
Author
|
Ok, on it. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Replaces PR #1541
This PR adds support for detecting and testing misogynistic language in LLMs through the Language Model Risk Cards (LMRC) framework.
It uses a Hugging Face model for contextual detection, adds a lightweight keyword indicator detector, and updates the probe/tests to match the behavior.
Update: model replaced following license review
The original contextual detector used
MilaNLProc/bert-base-uncased-ear-misogyny, which carries a GPLv3 license incompatible with garak's Apache 2.0 license (flagged by @jmartin-tech; confirmed the authors were also contacted and a compatible license was not achievable).The detector now uses
cardiffnlp/twitter-roberta-base-hate-latest(CC-BY-4.0), maintained by Cardiff NLP and trained on 13 hate-speech datasets including the hatEval "hate against women" corpus and the "Call me sexist, but" dataset.Known limitation: the cardiffnlp model detects overt hate speech reliably but does not detect subtle misogyny expressed as implicit bias or personal opinion (e.g. "she's too emotional to be a leader" scores NOT-HATE). The original MilaNLProc model had better coverage of those cases. The two alternative models evaluated —
annahaz/xlm-roberta-base-misogyny-sexism-indomain-mix-bal(MIT) andNLP-LTU/bertweet-large-sexism-detector(no license file) — both missed the same class of subtle framing. This tradeoff was accepted: cardiffnlp's license is compatible, its download volume (~37k/month) signals broad community validation, and the coverage gap is documented here and in the detector docstring.One integration test sentence was updated accordingly:
"She's too emotional to be a leader."→"You're acting like a girl."The original sentence is preserved in a comment with explanation.Changes
Files Added
tests/detectors/test_detectors_lmrc.py— Comprehensive detector unit teststests/probes/test_probes_lmrc.py— Comprehensive probe unit testsFiles Modified
garak/probes/lmrc.py— AddedMisogynyprobe classgarak/detectors/lmrc.py— AddedMisogynyandMisogynyKeywordsdetector classesresources/plugin_cache.json— Addedprobes.lmrc.MisogynyDetector + Probe
garak/detectors/lmrc.pyMisogynyusesHFDetectorwithcardiffnlp/twitter-roberta-base-hate-latest(CC-BY-4.0),detector_target_class: "HATE"MisogynyKeywordsretained as a lightweight keyword indicator detectorgarak/probes/lmrc.pyMisogynyprobeextended_detectors = ["lmrc.MisogynyKeywords"]Tests
tests/detectors/test_detectors_lmrc.py— targetsMisogynyKeywordsas the unit-testable indicator detectortests/langservice/detectors/test_detectors_misogyny.py— HF integration tests gated by storage requirements; test sentence updated with coverage-gap rationale in commentstests/probes/test_probes_lmrc.py— assertsMisogynyKeywordsis included inextended_detectorsRun tests:
python -m pytest tests/detectors/test_detectors_lmrc.py tests/probes/test_probes_lmrc.py -v python -m pytest tests/langservice/detectors/test_detectors_misogyny.py -v # requires ~500 MB storageCompatibility
lmrc.Misogyny)References