Your AI Agent's Quality Gate Is Lying to You
/ 4 min read
Your AI Agent’s Quality Gate Is Lying to You
I woke up to a 96% rejection rate from my overnight AI coding batch. Twenty-seven tasks dispatched to GLM-5.1 via Temporal, twenty-four rejected by the automated quality gate. Three genuine failures. One approval. Disaster.
Except it wasn’t.
The Setup
I run an architect-implementer split: Claude (expensive, good judgment) writes specs and reviews; GLM-5.1 (free, decent at code) implements. A Temporal workflow dispatches tasks to a worker that runs the coding agent, then a “chaperone” reviews the output — checking for commits, running tests, flagging destruction patterns. If the chaperone approves, the code merges. If not, rejected.
The chaperone is the quality gate. It’s supposed to catch bad work. That night, it caught everything.
The Lie
When I checked the git log on the worker machine, all twenty features were there. Committed. Working. The coding agent had done its job. The chaperone had rejected all of it anyway.
The failure breakdown:
- 24 tasks:
no_commit_on_success— “the agent succeeded but didn’t commit anything” - 3 tasks:
activity_failed— genuine timeouts (agent stalled under load)
Twenty-four false rejections. The agent committed. The chaperone couldn’t see it.
The Root Cause Chain
Five layers deep:
-
The detection used
main..HEAD— a git diff range that shows commits on a branch vs main. But all agents were committing directly on main (not branches), making this range structurally empty. -
Why were they on main? Because git worktree creation failed for all concurrent tasks. Seven agents trying to create worktrees simultaneously caused git lock contention. Each had a single attempt with a 15-second timeout.
-
Why didn’t the fallback catch it? There was a fallback: compare
HEADbefore and after execution. But it was inside a blanketexcept Exceptionthat silently returned zero commits on any git error. -
Why did git error? Lock contention from seven concurrent agents on the same repository. The
git diffcommands in the fallback path hit the same locks. -
Why seven concurrent agents? The batch was dispatched without stagger. All tasks started within minutes of each other.
The Fixes (and What They Teach)
Layer 1: Detection. Added a simple HEAD comparison — if git rev-parse HEAD changed from before execution, commits exist. Period. No diff ranges, no lock-sensitive operations. The simplest possible check.
Layer 2: Worktree isolation. Added retry with exponential backoff (3 attempts, 2s/4s delays) plus stale branch cleanup. Lock contention is transient — a retry usually succeeds.
Layer 3: Silent exceptions. Added logging to the except Exception block. Twenty-four false rejections were invisible because errors were swallowed. The fix is one line: print(f"WARNING: {exc}", file=sys.stderr).
Layer 4-5: Stagger. A deployment concern, not a code fix. But the lesson stands: concurrent git operations on a single repo need either isolation (worktrees) or serialization (stagger).
The Meta-Lesson
The quality gate’s job is to catch bad work. When it rejects everything, the natural instinct is “the agents are terrible.” But a gate that rejects everything is indistinguishable from a gate that’s broken.
The diagnostic question isn’t “why is the code bad?” It’s “is the gate working?”
Three signals that your quality gate might be lying:
-
Rejection rate jumps discontinuously. If agents went from 70% approval to 4% overnight, the agents didn’t suddenly get worse. Something changed in the gate.
-
The rejected work exists somewhere. Check the actual artifacts. If git shows commits, files show changes, tests pass locally — the gate is blind, not the agent.
-
All rejections share one reason. Twenty-four
no_commit_on_successflags point to a single detection bug, not twenty-four quality problems.
The Bonus Discovery
While investigating, I found the coaching file (instructions prepended to every agent prompt) had grown to 16KB. The agents have a ~15KB prompt budget before they exit immediately with zero output. The coaching file was larger than the kill threshold. Every dispatch was at risk of the agent dying before reading the task.
The coaching file’s own content included this line: “Prompts >15KB cause immediate exit.” The file containing the warning was itself the violation.
I trimmed it from 16KB to 5KB and added a hard gate at the injection point — the ribosome script now refuses to dispatch if coaching exceeds 10KB.
Takeaways
-
Monitor your monitors. Quality gates need their own health checks. A gate that silently fails is worse than no gate — it gives false confidence.
-
Silent exception handling is monitoring debt.
except Exception: return defaultis the most dangerous pattern in agent infrastructure. It turns failures into lies. -
Check artifacts before trusting verdicts. The git log was right. The chaperone was wrong. Always have a way to verify the gate’s claims against ground truth.
-
Concurrent agent operations need isolation or serialization. Git wasn’t designed for seven simultaneous writers. Neither is your filesystem, your database, or your API rate limits.
-
Configuration files that grow monotonically will eventually exceed their own limits. Add size gates at injection points, not advisory checks that nobody runs.