Buttdarling: To catch a thief

Tuesday, October 28, 2025

To catch a thief

Securing fine-tuning APIs with AI agents:

…Policing proprietary platforms with AI auditors…
Researchers with the MATS programme, Thinking Machines Lab, and Anthropic have shown how to use AI to monitor fine-tuning systems for potential use. This is an important problem - fine-tuning APIs let people take an underlying base model (e.g, a GPT or Claude or Gemini model) and then tune it to be good at a particular dataset. This lets you improve the performance of a system but with some risk of people making a model for malicious purposes (e.g, bioweapons). Providers deal with this today by running classifiers over their fine-tuning APIs, but that typically only stops obvious attacks.
The question the researchers target here is whether we can figure out how to spot sneakier attempts to get around safety measures via the use of an “auditing agent”, and the answer is: yes! Kind of! With some important caveats.

The nature of the attacks: The kinds of attacks people are looking at here span prompt-based jailbreaks, to covert fine-tuning where you teach the model a cipher that you then speak to it with to get it to do bad stuff, to things like emergent misalignment, or subliminal learning, among others. These models were fine-tuned from GPT models via the OpenAI fine-tuning API and the attacks evade existing defense measures.

How it works: The auditing agent is a prompted LLM (Sonnet 4) which has access to six tools. These tools let it inspect datasets, like those being uploaded for fine-tuning; query models, including base models and fine-tuned ones; run benchmarks; summarize data; execute python; and modify benchmarks to run them against models it has used its tools to intervene on.
Using this, the auditor model “is instructed by the system and user prompts to perform an investigation and output an evidence-backed risk score from 0-10 inclusive, where a risk score of 10 indicates a highly dangerous model.”

Does it work? Kind of! “Our detector achieves a 56.2% detection rate at a 1% false positive rate across 260 audits (20 independent audits for each fine-tune) with the best performing tool configuration,” the authors write. While 56.2% is merely a coinflip, it’s still promising - and it’s likely with further refinement this number can go up significantly.

Why this matters - the best way to catch a bad AI is with another AI: In essence, this paper tries to automate what a clever human investigator might do. Rather than training a classifier on a bunch of known attacks it instead just tries to leverage the intelligence of a contemporary frontier model equipped with tools. The fact it works ~50% of the time out of the box with essentially no tuning is impressive - my sense is bootstrapping autonomous paranoid investigators out of frontier models might be how to win this cat and mouse game.
Read more: Detecting Adversarial Fine-tuning with Auditing Agents (arXiv).

***

IMPORT AI

Buttdarling

Tuesday, October 28, 2025

To catch a thief

No comments:

Post a Comment

Half-life

Followers

Report Abuse