Agents That Run While You Sleep: The Verification Layer Autonomous Coding Needed


The Real Problem Is Not Code Generation

Most teams now have tools that can generate code quickly. The throughput problem is mostly solved.

The blocking problem is verification confidence:

  • Did the agent implement the exact requirement, not a nearby interpretation?
  • Did the UI still work in the browser across realistic flows?
  • Did regressions sneak in while the agent optimized for one path?
  • Can you trust an unattended run enough to wake up and merge?

Without a verification layer, “overnight coding” creates morning cleanup. With a verification layer, it can create morning momentum.

The Pattern: Separate Builder and Judge

The core design pattern in the original workflow is simple and powerful:

  1. One agent builds the feature.
  2. A separate loop validates behavior against explicit acceptance criteria.
  3. Promotion depends on pass/fail evidence, not on the builder’s self-assessment.

That separation is what turns an assistant into an operational system. The builder can move fast. The judge can stay strict.

Why Spec-First Inputs Matter

The workflow starts with a requirements document, not an open-ended prompt.

A good spec for agent execution has three traits:

  1. It defines behavior in observable terms.
  2. It encodes constraints (performance, security, UX boundaries).
  3. It includes acceptance criteria that can be tested from outside-in.

If a requirement cannot be expressed as a testable condition, the verifier cannot enforce it. At that point, your agent loop is back to “looks good to me,” which does not scale.

Acceptance Criteria as an Execution Contract

A useful acceptance criterion is concrete enough that two independent agents would evaluate it the same way.

Weak AC:

  • “The dashboard should feel fast and easy to use.”

Strong AC:

  • “Given a signed-in user with 200 records, opening /dashboard shows primary metrics in under 2 seconds and renders the table with no JavaScript console errors.”

When ACs are written this way, they become a machine-checkable contract:

  • the builder knows the target,
  • the verifier knows the test,
  • reviewers get evidence instead of prose.

Parallel Verification Is the Throughput Multiplier

The post’s most practical contribution is running one browser-checking agent per acceptance criterion.

Instead of a serial test chain like:

  • AC1
  • then AC2
  • then AC3

you run AC checks concurrently and collapse total validation time. This is where unattended workflows become useful in real teams: verification no longer becomes the bottleneck.

A minimal architecture looks like this:

  1. Parse spec and extract testable ACs.
  2. Generate one execution plan per AC.
  3. Launch parallel browser agents against local/staging app.
  4. Collect artifacts (screenshots, traces, video, logs).
  5. Run a deterministic judging pass.
  6. Emit pass/fail report with exact failing criteria.

This is effectively the same model used by industrial CI systems, but adapted for agent-driven inner loops.

Browser-Level Evidence Beats “I Think It Works”

The workflow uses Playwright-backed automation to validate user-visible behavior.

That choice matters:

  • Unit tests prove local logic.
  • Integration tests prove system wiring.
  • Browser automation proves user outcomes.

Autonomous builders frequently produce code that compiles and passes unit tests but still fails real user paths. Browser evidence closes that gap.

From an operations perspective, artifacts are non-negotiable. Every failed AC should provide:

  • screenshot at failure point,
  • execution trace,
  • optional session recording,
  • concise reason string tied to a criterion ID.

This turns failure triage from “reproduce first” into “fix directly.”

Headless Agent Execution in CI-Like Loops

A major enabler is headless invocation (claude -p) for deterministic, non-interactive runs. That gives you scriptable orchestration:

  • prompt in,
  • bounded tool execution,
  • structured output out.

In practice, you should add hard guardrails around headless runs:

  1. Explicit max execution budget (time + tokens).
  2. Allowed command/tool boundaries.
  3. Clean workspace bootstrap per run.
  4. Stable output schema for downstream parsing.

If you skip these controls, you trade away predictability and make failures harder to debug.

The Missing Piece Most Teams Ignore: Pre-Flight Checks

Before launching expensive parallel verification, run a pre-flight stage:

  • app boots successfully,
  • required env vars are present,
  • test accounts/data exist,
  • target routes load,
  • critical API mocks/services are reachable.

Pre-flight failures should terminate early with actionable diagnostics. This saves run time and avoids noisy false negatives across every AC worker.

Failure Taxonomy You Should Adopt

To keep overnight runs trustworthy, categorize failures by class:

  1. Spec ambiguity: AC cannot be objectively tested.
  2. Environment issue: server/data/auth preconditions failed.
  3. Execution issue: tool/browser timeout, flaky selector, infra hiccup.
  4. Product failure: implemented behavior does not satisfy AC.

Each class has different ownership:

  • product/spec owner fixes ambiguity,
  • platform/devex fixes environment,
  • automation owner fixes execution,
  • feature owner fixes product behavior.

This prevents one team from drowning in every incident.

How to Use a Judge Without Letting It Drift

A judge agent can be useful, but only if it remains constrained.

Good design:

  • Judge reads fixed artifacts.
  • Judge maps findings to AC IDs.
  • Judge outputs structured pass/fail with a short rationale.
  • Final status is derived from deterministic rules.

Bad design:

  • Judge is free-form and reinterprets requirements each run.
  • Judge can waive failures without explicit policy.
  • Judge output is unstructured prose.

If the judge becomes creative, your pipeline becomes non-repeatable.

Security Boundaries for Unattended Agents

If your agents run while nobody is watching, security posture matters more than prompt quality.

Baseline controls:

  1. Use least-privilege tokens scoped to the run purpose.
  2. Block secret exfiltration paths in logs/artifacts.
  3. Pin dependencies and isolate runtime per run.
  4. Enforce branch protection and signed provenance for merges.
  5. Require human approval for high-risk file regions (auth, billing, infra, security controls).

Autonomy should increase productivity, not blast radius.

Rollout Strategy That Actually Works

Do not start with full repo autonomy. Start with a narrow lane:

  1. One service or UI slice.
  2. 3-5 high-quality acceptance criteria.
  3. Readable artifacts and deterministic reporting.
  4. Human merge gate still required.

Then expand:

  • increase AC coverage,
  • reduce flaky checks,
  • tighten prompt/spec templates,
  • automate low-risk merge paths only after stability data.

This staged approach avoids the “big bang autonomous rewrite” trap.

Reference Implementation Stack

A practical stack based on the ecosystem around the original post:

  • Builder: Claude Code in scripted/headless mode.
  • Verifier orchestration: spec interpreter + planner + parallel AC runners.
  • Browser execution: Playwright MCP server.
  • Result packaging: AC-indexed JSON + human-readable markdown report.
  • Storage: per-run artifacts under deterministic folder structure.

The open verify project from Opslane illustrates this pattern clearly with a spec interpreter, planner, one-agent-per-criterion execution, and a judge/report phase. You can adopt the architecture even if your exact toolchain differs.

What This Means for Engineering Teams in 2026

The HN debate is often framed as “Will agents replace developers?” That is the wrong operational question.

The right question is:

  • Can your team define behavior precisely,
  • verify that behavior automatically,
  • and ship with measurable confidence?

Teams that can do this will safely run more autonomous work. Teams that cannot will keep using agents as fancy autocomplete.

The differentiator is not model IQ. It is verification discipline.

Closing

“Agents that run while I sleep” resonated because it captures a transition many teams are currently making: from assisted coding sessions to managed autonomous delivery loops.

The winning architecture is not magical. It is familiar software engineering discipline applied to AI execution:

  • spec-first requirements,
  • explicit acceptance criteria,
  • independent verification,
  • artifact-backed judgment,
  • deterministic promotion gates.

When those pieces are in place, overnight runs stop being a gamble and start being a force multiplier.

Sources