Your LLM Review Missed a Verb/Noun Mismatch
/ 3 min read
Look at three items from the OWASP Top 10: SQL Injection, Sensitive Data Exposure, Broken Access Control. Each one is a legitimate security concern. Each one belongs on the list. Ask any LLM whether these items are appropriate for a web security taxonomy and it will say yes without hesitation. But they are three different kinds of thing. SQL Injection is a technique — something an attacker does. Sensitive Data Exposure is an outcome — something that happens as a result. Broken Access Control is a state — something that exists before anyone does anything. Verb, noun, adjective. Same list.
I ran a similar list past six frontier models across multiple review rounds. None of them flagged the inconsistency. I caught it myself.
This is a specific blind spot. LLMs are excellent at checking whether individual items sound credible. Each item, read in isolation, is fine. The models confirmed every one enthusiastically. What they did not check is whether all items are the same kind of thing. Semantic consistency within a list — are these all techniques, or all outcomes, or all states? — is something models routinely miss.
I think this happens because LLMs evaluate list items the way they evaluate tokens: each one is assessed in context, but the context is “does this belong here?” not “is this the same type as its siblings?” The model asks “is Sensitive Data Exposure a real security concern?” and the answer is yes. It does not ask “is Sensitive Data Exposure a technique like SQL Injection?” because that requires holding the list’s implicit schema in mind and checking each item against it.
This matters more than tone or grammar. A list where every item sounds right but three items are categorically different will confuse its readers. They will not know why it feels off. They will just feel less confident in the author’s precision. For anything with professional stakes — risk frameworks, compliance matrices, control catalogues — that loss of confidence is the real cost.
The fix is not to stop using LLMs for review. It is to know what they miss. Add a specific prompt: “are all items in each list the same type of thing?” Even then, models catch it maybe half the time. The reliable fix is a human who reads the list and asks: what is each item, exactly? Not “does it belong?” but “what kind of thing is it?”
Six models. Multiple rounds. One human caught what all of them missed. Not because the human is smarter. Because the human was checking something the models were not.