AI Evaluation — Brandon Behring

prompt-injection detection: out-of-distribution (OOD) study

released

A companion study to the prompt-injection PoC with a different question: do detectors generalize out-of-distribution? Detectors trained on direct-injection attacks are evaluated against unseen attack families (indirect, optimization-based, context-poisoning). The artifact is written up as a research paper — an IMRAD write-up plus a narrative — with CI-checked reproduction and deployed docs. It is a sibling of the existing prompt-injection detector, not a replacement.

Stack: Python · PyTorch · DeBERTa · Quarto · scikit-learn

Repo → Live site →

What's next

Findings feed back into the eval-toolkit harness as out-of-distribution robustness test patterns.

eval-toolkit

released

Evaluation library for binary-classification AI/ML systems. Encodes the methodology used in prompt-injection-detector as reusable components: baseline ladders, bootstrap confidence intervals on metrics and lift, calibration tooling, and explicit stop-gates that fire before adding model complexity.

Stack: Python · scikit-learn · numpy

Repo →

What's next

Published to PyPI as eval-toolkit. Continuing: API hardening and calibration tooling, still behind explicit stop-gates.

← Work · ← Home