skip to content

Tool Health Is the Missing Layer of Agent-Native Apps


The first failure mode of an agent-native app is not that the agent cannot use a tool. It is that the tool half-works and the answer still sounds complete.

This post is a riff on Every’s The Dawn of Codex-native Apps by Katie Parrott, and on the delegation-versus-collaboration frame she reports from Dan Shipper. Dan’s point names the human workflow: sometimes the agent goes off and does the work, and sometimes it sits beside you. My addendum is about the substrate both modes depend on: whether the environment can tell when the agent’s tools actually worked.

That is why I like the broader phrase agent-native app. It points at a real shift. The application is no longer only the interface a human clicks through. It is the whole working surface an agent can inhabit: files, commands, browser state, credentials, tests, logs, memory, and whatever outside services the work depends on. A chat box attached to a database is not enough. A single model call with a few functions is not enough. The interesting boundary is the workflow.

But I think the phrase is missing its most operational layer. Agent-native apps do not become trustworthy when an agent can call tools. They become trustworthy when the app can prove those tools worked.

This sounds like infrastructure trivia until the first time a tool half-works. The model asks for research. The search surface returns something. The agent summarizes it fluently. The output looks plausible because language is good at smoothing over missing inputs. But one provider timed out, another had no credentials, another returned stale cache, and the only sources that answered were the ones least suited to the question. From the outside, the system appears to have used its tools. From the inside, it made a claim on partial evidence.

That is the dangerous middle state. Total failure is often obvious. The command errors. The page does not load. The test suite turns red. Partial tool failure is quieter. It produces enough material for the agent to continue, but not enough for the result to deserve confidence. In a human interface, this is annoying. In an agent-native app, it is a design flaw, because the agent will often continue unless the environment makes the failure legible.

So the real product surface is not the tool alone. It is the health of the tool. Can the app tell whether the credential exists? Can it tell whether the tool was reachable? Can it tell whether the response was empty, stale, truncated, malformed, rate-limited, or drawn from the wrong backend? Can it distinguish “we searched the web” from “two of seven search providers answered and one of them was a cached snippet”? Can it surface that difference before the agent turns the result into prose?

This is where agent-native design starts to look less like prompt design and more like operating-system design. The agent needs affordances, but it also needs invariants. If a workflow depends on seven backends, the result should carry a backend count and a success count. If a command can degrade, degradation should be a first-class state. If a result can be stale, freshness should be explicit. If an action depends on a secret, the system should know whether the secret is present before the agent starts improvising around its absence.

The same applies to native app development. The useful loop is not “the agent wrote code.” It is “the agent wrote code, built it, ran it, inspected the simulator, checked the screenshot, and reported what actually happened.” The app is not agent-native because it has an AI feature. It is agent-native because the agent can move through the environment and the environment can answer back with evidence.

The temptation is to treat this as a reliability problem to solve later. First make the tool callable, then add health checks when it breaks. That order is backwards. In an agent-native system, the health check is part of the interface. A tool without a health model is not a tool. It is a hope that happens to have an API.

The distinction matters because agents are unusually good at hiding uncertainty from the user. They compress messy execution into clean narrative. That is useful when the mess was real work. It is harmful when the mess was a missing backend. A good agent-native app should resist that smoothing. It should make the agent carry the evidence forward: what was attempted, what answered, what failed, what was skipped, and what confidence the result earned.

This is also why agent-native apps will not be defined by whether they use a GUI, a CLI, a browser, or a mobile shell. Those are delivery surfaces. The durable pattern is whether the system gives the agent a verifiable world to act in. Commands are useful because they can be repeated. Tests are useful because they can be checked. Logs are useful because they survive the turn. Memory is useful because corrections can compound. Health is useful because it stops partial failure from masquerading as success.

The next generation of these apps should compete less on how many tools they expose and more on how clearly they expose tool truth. Tool count is the shallow metric. The better metric is whether each claim can carry a record of the tools that earned it.

Agent-native apps are going to make software feel more alive. That is exciting. It also raises the bar for boring machinery. The living system needs senses. Otherwise it is not acting in the world. It is narrating around blind spots.