May 30, 2026

Cloudflare's AI Code Review System Is an Orchestrator, Not a Chatbot

AI code review sounds simple until you try to put it in the merge path.

The simple version is obvious: take a diff, paste it into a model, ask for bugs, and post the answer back to the pull request. That can work for small experiments. It can even find real issues. But as soon as the review has to run across thousands of repositories, support different teams, avoid noisy comments, survive provider failures, and keep engineers from waiting around, the problem stops looking like “prompt engineering” and starts looking like distributed systems.

That is the useful lesson in Cloudflare’s internal AI code review write-up. The interesting part is not that an LLM can read code. We already knew that. The interesting part is the amount of surrounding machinery needed before an AI reviewer becomes something a large engineering organization can actually tolerate.

Cloudflare built its reviewer around OpenCode, plugins, model failback, concurrent specialist reviewers, structured outputs, risk tiers, re-review state, and observability. In its first reported month, the system completed 131,246 review runs across 48,095 merge requests in 5,169 repositories. The median review finished in 3 minutes and 39 seconds. The average review cost $1.19, with a median of $0.98 and a P99 cost of $4.45.

Those numbers matter because they move the conversation away from vibes. AI review is no longer just a question of whether a model can spot a missing null check. It is a question of whether the whole system can deliver useful comments cheaply, quickly, repeatably, and with low enough noise that developers do not learn to ignore it.

The First Trap Is Treating Review as One Big Prompt

The naive design fails in predictable ways.

If you feed a large diff into one general-purpose prompt, the model tends to produce a mixed bag: some real issues, some broad suggestions, some hallucinated problems, and a lot of advice that a human reviewer would never bother typing. “Consider adding error handling” is not useful when the function already handles the error. “This may be inefficient” is not useful without a concrete path to a production problem.

Cloudflare says it tried both commercial AI review tools and the direct diff-to-model approach before landing on orchestration. The commercial tools were not customizable enough for its internal environment. The direct prompt approach was too noisy. That failure mode is important. At scale, the bottleneck is not only model intelligence. It is control.

A production reviewer needs to know what kind of change it is reading, which standards apply, how much effort the review deserves, which findings are severe enough to block a merge, and how to avoid repeating itself after an author pushes a fix. A single generic prompt can approximate some of that, but it becomes brittle fast.

Cloudflare’s answer was to split the work. A coordinator agent decides how to review the merge request, then launches specialist reviewers for areas like security, performance, documentation, code quality, release risk, internal engineering standards, and AGENTS.md freshness. The coordinator receives structured findings back from those reviewers, deduplicates them, judges severity, and posts one review instead of a pile of independent comments.

That design is closer to a review pipeline than a reviewer bot.

Plugins Keep the System From Becoming a Tangle

The system is built around plugins rather than a hardcoded dependency graph.

That sounds like a small implementation choice, but it matters. A reviewer running across thousands of repositories has to talk to version control, fetch merge request metadata, choose models, apply team-specific policy, collect traces, post comments, and load local instructions. If every part of that knows about every other part, the tool becomes difficult to change before it even becomes reliable.

Cloudflare’s plugin model separates responsibilities. One plugin handles the version-control provider. Another configures Cloudflare AI Gateway and model tiers. Another checks internal engineering rules. Another brings in tracing. Another verifies whether AGENTS.md should be updated. Another fetches remote reviewer configuration. The core assembler combines those plugin contributions into the OpenCode configuration used for the review.

The lifecycle is also split by risk. Bootstrap hooks run concurrently and are non-fatal, so optional context can fail without killing the review. Configure hooks run sequentially and are fatal, because there is no point reviewing a merge request if the system cannot talk to the version-control provider. Post-configuration hooks handle asynchronous setup such as fetching remote model overrides.

That gives the system a practical property: optional enrichment can be flaky without stopping the core review, while required integration failures fail early and clearly.

This is one of the places where AI tooling starts to look like ordinary platform engineering. The model is only one component. The interfaces around it determine whether the system can evolve.

OpenCode Is Used as a Server, Not Just a CLI

Cloudflare chose OpenCode partly because it is open source and already familiar internally, but the architectural reason is more specific: OpenCode can be driven programmatically.

The coordinator process starts OpenCode as a child process and passes the review prompt through standard input. That avoids command-line argument limits on large merge requests. The process emits JSONL, which lets Cloudflare parse events incrementally instead of waiting for one giant JSON document. That matters when a long-running agent fails halfway through a job. With JSONL, every completed line is still a valid event.

Inside the OpenCode process, a runtime plugin exposes a spawn_reviewers tool. When the coordinator decides the merge request needs specialist analysis, it calls that tool. The plugin creates separate OpenCode sessions for each reviewer. Each reviewer receives its own agent prompt and can inspect the codebase independently before returning structured findings.

This is a cleaner shape than making the coordinator pretend to be every expert at once. The coordinator’s job is judgment and synthesis. The sub-reviewers’ job is focused inspection.

There is also an important operational boundary here: the coordinator does not micromanage each reviewer. A security reviewer can search the codebase, inspect files, and reason through a specific risk. A documentation reviewer can look at different files and conventions. The output comes back in a structured format, and the coordinator decides what deserves to reach the developer.

That last step is essential. Without synthesis, multi-agent review can easily make noise worse. Seven agents can generate seven overlapping versions of the same complaint. The coordinator exists to keep parallelism from turning into spam.

Concurrency Needs Schedulers, Timeouts, and Failure Modes

Launching up to seven reviewer sessions sounds straightforward until those sessions hang, rate limit, crash, produce no output, or finish at different times.

Cloudflare’s spawn_reviewers tool is effectively a small scheduler for model sessions. It tracks reviewer lifecycle, watches for idle events, polls status every few seconds, detects inactivity, applies timeouts, retries where appropriate, and routes around provider failures. The article describes this as one of the hard parts: knowing when an LLM session is actually done is not always clean.

This is the less glamorous layer of agent work, and it is the layer that decides whether the tool is trusted.

If a reviewer silently hangs for ten minutes, developers stop waiting for it. If the system posts partial findings without saying which reviewers failed, the output becomes hard to trust. If rate limits cause random job failures, teams will disable the reviewer when they are under release pressure. If every failure retries aggressively, the system can stampede an already struggling provider.

Cloudflare addresses this with circuit breakers and failback chains. A retryable model error can open a circuit for that model tier. After a cooldown, the system allows a probe request to see whether the provider recovered. When a model is unhealthy, the system walks a same-family failback chain instead of blindly switching to a completely different model profile.

The error classifier matters too. A retryable API error is different from bad credentials, context overflow, a user abort, or malformed structured output. Only some failures should trigger model failback. Others need to fail clearly because a different model will not fix the underlying problem.

This is exactly the kind of detail that separates a demo from infrastructure.

The Reviewer Remembers Prior Reviews

Re-review behavior is one of the strongest parts of the design.

Most automated review systems are annoying because they have no memory. A developer pushes a fix, and the bot comments again as if it has never seen the merge request before. Or the developer resolves a thread, and the bot reopens it without understanding why. Humans quickly learn to treat the tool as a stateless nag.

Cloudflare’s system gives the coordinator its previous review comment and the list of inline comments it posted, including resolution status. The rules are explicit:

fixed findings should disappear
unfixed findings should be emitted again so the thread remains alive
user-resolved findings should generally stay resolved unless the issue materially worsened
author replies such as “won’t fix” or “acknowledged” can be treated as resolution
disagreement should be read and evaluated, not blindly dismissed

That makes the reviewer part of the conversation instead of a fresh bot run on every push.

The principle is broader than code review. Any AI tool that sits in a workflow needs continuity. It must know what it already said, what the human did with that feedback, and what changed since the last run. Without that loop, the tool cannot distinguish a new problem from an already-handled one.

AGENTS.md Review Is a Clever Bit of Self-Maintenance

One specialist reviewer checks whether AGENTS.md should change.

That may sound meta, but it is practical. AI coding agents rely on repository instructions. Those instructions age quickly. A team changes test frameworks, package managers, build systems, directory layout, environment variables, or deployment flow, and the agent’s instructions keep describing the old world. The next agent then wastes time running the wrong commands or following obsolete conventions.

Cloudflare’s AGENTS.md reviewer classifies merge requests by materiality. Package manager changes, test framework changes, build tooling changes, major restructures, new required environment variables, and CI changes are high-signal reasons to update instructions. Smaller dependency bumps, API client changes, state management changes, and linting changes may also matter. Routine bug fixes and small CSS changes usually do not.

It also discourages bad instruction files: generic filler, bloated files, and tool references without runnable commands. That is the right pressure. Instructions for agents should be short, concrete, and operational. “Write clean code” wastes context. “Run npm test from this directory” is useful.

This is a sign that Cloudflare is treating AI review as an ecosystem, not a single product. The reviewer improves the context that future reviewers and coding agents will consume.

Risk Tiers Keep Cost Under Control

The system does not run the maximum review on every change.

That would be expensive and slow. A typo fix does not need seven specialist agents and frontier models. A sensitive authentication refactor probably does. Cloudflare uses risk tiers so lightweight changes get lightweight review and high-risk changes get fuller orchestration.

The reported cost breakdown shows why this matters. Trivial reviews averaged $0.20. Lite reviews averaged $0.67. Full reviews averaged $1.68. The P99 full review was just over $5. That is still real money at Cloudflare’s volume, but it is a manageable shape because the system is not treating every merge request as equally risky.

The token numbers reinforce the same point. Over the measured month, the system processed about 120 billion tokens, with a high cache hit rate. Prompt caching, shared context, stable base prompts, and repeated review structure all matter when the same type of review runs thousands of times a day.

This is another place where production AI work differs from one-off AI use. Cost optimization is not just “use a cheaper model.” It is routing. It is caching. It is stable prompts. It is only launching expensive reviewers when the diff justifies them.

Low Noise Is a Product Feature

Cloudflare reported 159,103 findings across 131,246 review runs, or about 1.2 findings per review. That is deliberately low.

This is probably the most important product choice in the whole system. A code reviewer that comments too much is worse than a reviewer that misses some minor issues. Developers can tolerate a tool that occasionally misses something. They will not tolerate a tool that constantly interrupts them with low-value criticism.

The system’s prompts include “what not to flag” sections. That is a good pattern. Many AI review prompts focus only on what to find: bugs, vulnerabilities, edge cases, missing tests, performance issues. The equally important half is what to ignore: subjective style differences, already-handled errors, speculative rewrites, generic best practices, and comments that do not change the merge decision.

The reviewer’s job is not to prove it read the diff. The job is to improve the merge.

That means every comment should pass a practical test: would a competent human reviewer be willing to block or delay the change over this? If not, the AI should usually stay quiet.

It Still Does Not Replace Human Review

Cloudflare is clear about the limits.

The reviewer can inspect diffs and nearby code, but it does not fully understand the history of every architectural decision. It can notice an API contract change, but it may not verify every downstream consumer. It can flag suspicious concurrency patterns, but subtle timing bugs remain hard. Very large refactors are expensive and can exceed the practical context budget.

Those limits are not embarrassing. They are the shape of the tool.

AI review is strongest as a fast first pass: catching obvious bugs, enforcing known standards, looking for security footguns, checking documentation impact, spotting missing instruction updates, and giving humans a cleaner starting point. It is weaker at deciding whether a design belongs in the system, whether a product tradeoff is acceptable, or whether a change aligns with long-term architecture.

The right mental model is not “replace reviewers.” It is “move repeatable review work earlier and make human review focus on judgment.”

That is still valuable. Median first review wait measured in hours can slow teams down even when the eventual human review is good. A three-minute automated pass that catches real issues before a human opens the diff can shorten the loop. It can also save human attention for the parts that need human context.

The Pattern Other Teams Should Copy

Most teams should not copy Cloudflare’s exact system. They do not have Cloudflare’s repository count, internal platform, model routing needs, or review volume.

But the pattern is worth copying:

start with low-noise review, not maximum coverage
split review into specialist concerns instead of one giant prompt
use a coordinator to deduplicate and judge severity
make review incremental across pushes
respect human dismissals and author explanations
route by risk so small changes stay cheap
classify failures instead of retrying blindly
keep agent instructions current
measure cost, duration, break-glass rate, and finding volume

The shortest version is this: AI code review needs an operating model.

Without one, it becomes another bot that leaves vague comments and annoys developers. With one, it becomes a useful layer in the delivery pipeline: fast, mostly quiet, measurable, and good at the repeatable parts of review.

Cloudflare’s write-up is useful because it shows the boring structure around the shiny part. The model reads the code, but the system decides when to ask, what to ask, how many agents to launch, when to stop, what to post, what to suppress, when to retry, and how to remember the conversation.

That is the future of serious AI developer tooling. Not one chatbot in the sidebar. A set of bounded agents wired into the workflow with the same care we give any other production system.