AI Safety Infrastructure · Technical AI Governance · Global Health · Biosecurity
I build evaluation infrastructure for failure modes in AI systems that only appear when context matters.
Current work measures contextual validity — whether frontier models produce reasoning that is actually appropriate for the person, setting, and system they claim to help.
AI evaluation is currently optimised for correctness, not context.
A model can pass every major benchmark — clinical QA, reasoning, bias — and still fail in deployment.
Not because it lacks knowledge, but because it misreads the situation.
This is not hallucination.
This is not bias.
This is a structural evaluation gap.
Framework overview, domain statuses, and full results across D1, D2, and D3 for Claude, GPT-4o, and Gemini.
→Papers, conference presentations, and the open-source evaluation framework.
→Narrative, credentials, speaking engagements, and collaborations.
→