Your AI Agent's Quality Gate Is Lying to You

8 Apr 2026 · 4 min read ·

I woke up to a 96% rejection rate from my overnight AI coding batch. Twenty-seven tasks dispatched to GLM-5.1 via Temporal, twenty-four rejected by the automated quality gate. Three genuine failures. One approval. Disaster.

Except it was not.

I run an architect-implementer split: Claude writes specs and reviews, GLM implements. A Temporal workflow dispatches tasks to a worker that runs the coding agent, then a chaperone reviews the output — checking for commits, running tests, flagging destruction patterns. If the chaperone approves, the code merges. If not, rejected. The chaperone is the quality gate. It is supposed to catch bad work. That night, it caught everything.

When I checked the git log on the worker machine, all twenty features were there. Committed. Working. The coding agent had done its job. The chaperone had rejected all of it anyway. Twenty-four tasks flagged as “the agent succeeded but did not commit anything.” Three genuine timeouts. The agent committed. The chaperone could not see it.

The root cause was five layers deep. The commit detection used a git diff range that shows commits on a branch versus main. But all agents were committing directly on main, not branches, making this range structurally empty. They were on main because git worktree creation failed for all concurrent tasks — seven agents trying to create worktrees simultaneously caused git lock contention, and each had a single attempt with a fifteen-second timeout. There was a fallback: compare HEAD before and after execution. But it was inside a blanket exception handler that silently returned zero commits on any git error. The git errors came from lock contention caused by seven concurrent agents on the same repository. And seven concurrent agents existed because the batch was dispatched without stagger. All tasks started within minutes of each other.

The fixes matched the layers. A simple HEAD comparison — if the hash changed from before execution, commits exist, period. No diff ranges, no lock-sensitive operations. Worktree creation got retry with exponential backoff and stale branch cleanup, because lock contention is transient and a retry usually succeeds. The silent exception handler got one line of logging. Twenty-four false rejections were invisible because errors were swallowed. The single most dangerous pattern in agent infrastructure is except-Exception-return-default. It turns failures into lies.

While investigating, I found the coaching file — instructions prepended to every agent prompt — had grown to 16KB. The agents have a roughly 15KB prompt budget before they exit immediately with zero output. The coaching file was larger than the kill threshold. Every dispatch was at risk of the agent dying before reading the task. The coaching file’s own content included the line “Prompts over 15KB cause immediate exit.” The file containing the warning was itself the violation. I trimmed it to 5KB and added a hard gate at the injection point — the ribosome script now refuses to dispatch if coaching exceeds 10KB.

The quality gate’s job is to catch bad work. When it rejects everything, the natural instinct is “the agents are terrible.” But a gate that rejects everything is indistinguishable from a gate that is broken. The diagnostic question is not “why is the code bad?” It is “is the gate working?”

Three signals that your quality gate might be lying. If the rejection rate jumps discontinuously — agents going from 70% approval to 4% overnight — the agents did not suddenly get worse. Something changed in the gate. If the rejected work exists somewhere — git shows commits, files show changes, tests pass locally — the gate is blind, not the agent. And if all rejections share one reason, twenty-four identical flags point to a single detection bug, not twenty-four quality problems.

Monitor your monitors. Quality gates need their own health checks. A gate that silently fails is worse than no gate, because it gives false confidence. Check artifacts before trusting verdicts. The git log was right. The chaperone was wrong. Always have a way to verify the gate’s claims against ground truth.