AI Evaluation
Methodology-led AI safety evaluation: real evals, calibration, anti-overclaim discipline.
prompt-injection detection: OOD generalization study
releasedA companion study to the prompt-injection PoC with a different question: do detectors generalize out-of-distribution? Detectors trained on direct-injection attacks are evaluated against unseen attack families (indirect, optimization-based, context-poisoning). The artifact is paper-shaped — an IMRAD write-up plus a narrative — with CI-checked reproduction and deployed docs. It is a sibling of the existing prompt-injection detector, not a replacement.
Stack: Python · PyTorch · DeBERTa · Quarto · scikit-learn
What's next
Findings feed back into the eval-toolkit harness as out-of-distribution robustness test patterns.
eval-toolkit
in progressPre-v1 evaluation library for binary-classification AI/ML systems. Encodes the methodology used in prompt-injection-detector as reusable components: baseline ladders, bootstrap confidence intervals on metrics and lift, calibration tooling, and explicit stop-gates that fire before adding model complexity.
Stack: Python · scikit-learn · numpy
What's next
Hardening toward a v1.0 cut (locked API + PyPI publish) behind an explicit stop-gate.