The Black Box That Responds to Role Play

Third in a sequence: The Calculator AnalogyThe “Are You Sure?” Loop → this.


The AI that built my interview deck doesn’t feel accountability pressure. It has no ego to defend, no reputation at stake, no career consequences for submitting sloppy work. It’s a statistical engine predicting the next token.

And yet: when I asked “are you sure you want to submit this?”, it found real issues it had previously declared acceptable. Four times. Each round produced measurably better output — not because the model “tried harder,” but because the prompt context shifted from “review your work” to “defend your work to someone who might reject it.”

This is strange if you think about it. The underlying mechanism is a black box. Nobody — including the people who built it — can fully explain why a particular arrangement of words in the prompt produces a particular quality of output. We know that it works, not why.

What we can observe:

Role-play changes the output distribution. “Check this for errors” and “you’re about to present this to a hostile panel — what would they attack?” produce different results from the same model on the same input. The model doesn’t understand hostility. But the training data contains enough examples of defensive writing, adversarial review, and high-stakes scrutiny that the shape of the output changes when the context implies consequences.

Simulated accountability is empirically real. In my experiment, the “are you sure?” prompt caught banking jargon an MTR panel wouldn’t recognise, a metric claim that senior tech leaders would challenge, and an entire slide that created discoverable risk. None of these were caught by the model’s own self-review pass. The difference wasn’t compute or capability — it was framing.

The black box responds to social structure. Multi-model deliberation (five models debating, a judge synthesising, a critic reviewing the judge) consistently produces insights that no single model generates alone. Not because the models are “thinking together” in any meaningful sense — they’re separate inference calls with no shared state. But the structure forces each response to engage with a different prior context, which shifts the output distribution again.


This creates a genuinely weird situation for anyone building with AI. The tool is opaque. The levers are social, not technical. “Assign a persona,” “simulate rejection,” “create adversarial structure” — these sound like management techniques for humans, not engineering techniques for machines. And yet they work.

The practical implication: the people who get the most from AI won’t be the ones who understand transformers. They’ll be the ones who understand how to design contexts — prompts, structures, feedback loops — that reliably shift the output distribution toward quality. That’s closer to directing a play than programming a computer.

The uncomfortable implication: we’re building workflows around a mechanism we can’t explain, using techniques that work for reasons we can only guess at. The same role-play that produces better slide decks might, in a different context, produce confident nonsense. We can’t tell from the outside which case we’re in.

The honest answer is: it works, we don’t fully know why, and the right response is to use it while staying epistemically humble about it. Which, come to think of it, is how most of medicine worked for most of history.

Use the black box. Design the role play. Check the output. Don’t trust the mechanism — trust the evidence.