AI Evaluation

Methodology-led AI safety evaluation: real evals, calibration, anti-overclaim discipline.

prompt-injection detection: OOD generalization study

released

A companion study to the prompt-injection PoC with a different question: do detectors generalize out-of-distribution? Detectors trained on direct-injection attacks are evaluated against unseen attack families (indirect, optimization-based, context-poisoning). The artifact is paper-shaped — an IMRAD write-up plus a narrative — with CI-checked reproduction and deployed docs. It is a sibling of the existing prompt-injection detector, not a replacement.

Stack: Python · PyTorch · DeBERTa · Quarto · scikit-learn

What's next

Findings feed back into the eval-toolkit harness as out-of-distribution robustness test patterns.

eval-toolkit

in progress

Pre-v1 evaluation library for binary-classification AI/ML systems. Encodes the methodology used in prompt-injection-detector as reusable components: baseline ladders, bootstrap confidence intervals on metrics and lift, calibration tooling, and explicit stop-gates that fire before adding model complexity.

Stack: Python · scikit-learn · numpy

What's next

Hardening toward a v1.0 cut (locked API + PyPI publish) behind an explicit stop-gate.