Engineering Delivery · Leadership Briefing

AI Harness Impact on Delivery Velocity

How a 4-engineer team using an AI coding harness compares to an equivalent team without one — measured against the git commit history and grounded in published industry benchmarks.

Project: KMO Offerte Tool  ·  Window: 2026-05-04 → 2026-06-28 (8 weeks)  ·  Core team: 4 engineers  ·  Sources: git log + Azure DevOps (cross-validated)
Cross-validated against git history, the team merged 427 pull requests in 8 weeks (peak ~83/week), averaging ~360 lines of real, reviewable code each — roughly 5–6× the merged-throughput of a healthy 4-engineer team without AI tooling. Equivalent to compressing ~7 months of conventional delivery into 8 weeks, with automated tests and review gating every change.
427
PRs merged in 8 weeks
(git 427 ≈ ADO 425)
~83
PRs / week at peak velocity
5–6×
vs. a non-AI 4-engineer team
~360
lines of app/test code per PR (healthy size)
~7 mo
conventional delivery compressed into 8 wks
349
substantive feat + fix commits

Merged PRs per week

Counted from git "Merged PR" commits on main. The two empty weeks (May 11/18) are the pre-velocity ramp.

Throughput vs. industry baselines

Merged PRs per engineer per week. Baselines: DX (51k devs) & LinearB (8.1M PRs).

Time compression — cumulative merged PRs

Our team reached 427 merged PRs in 8 weeks. A healthy non-AI 4-engineer team (median ~14 merged PRs/week) would need ~30 weeks (~7 months) to reach the same line.

Where the lines of code actually go

95% of raw LOC is machine-generated scaffolding (knowledge graph, vendored ESB/SOAP schemas, planning docs). Only 4.8% is hand-written app code — which is why we lead with PRs, not LOC.

Nature of the work — commit types

Conventional-commit classification of authored commits. 349 feat+fix is real product engineering.

PRs by engineer

Non-renovate PRs authored (Azure DevOps). One lead carries the majority; all four contribute substantively.

The comparison, in numbers

Our full-velocity rate: ~20.7 merged PRs / engineer / week.

Non-AI baseline (cited)PRs/eng/wkOur multiple
Industry median (tech)3.55.9×
Top-quartile team (P75)4.34.8×
Top-decile / best case (P90)5.04.1×
Conservative claim~5×
Central estimate~6×
Why this number is trustworthy
Refuted objection

"It's inflated by trivial micro-PRs"

The average merged PR carries ~363 lines of real application/test code — dead-center in the recommended 200–400 line band. These are normal, reviewable PRs.

Cross-validated

"You're double-counting"

git log (427 "Merged PR" commits) and Azure DevOps (425 completed PRs) agree to within 0.5% — two fully independent sources.

Real engineering

"It's all AI filler"

155 feat + 194 fix + 77 refactor + 56 test commits. 65k lines of hand-written tests gate the 90k lines of app code.

Honest metric

"LOC is being gamed"

We don't use LOC as a productivity figure. 95% of bytes are openly disclosed as generated/vendored scaffolding produced by the harness for free.

Honest counter-evidence (for credibility)
Context

METR 2025: AI can slow experts down

A pre-registered RCT found experienced devs on large legacy codebases were 19% slower with agentic AI. Our opposite result fits the regime AI wins in: a greenfield build, no 1M-line legacy to fight.

Guardrails

DORA 2024: speed can cost stability

Unmanaged AI adoption is tied to −7.2% delivery stability. Our harness-enforced tests, mandatory review, and ~7% iterate-and-discard rate are what keep speed stable.

Caveat

Velocity ≠ value

PRs measure output, not outcome. The real proof is the 2026-10-26 go-live landing with quality. Recommend tracking change-failure-rate and escaped defects alongside throughput.

Scope

One team, one project

This is a single 4-engineer greenfield project. Gains may differ on legacy maintenance or larger teams. Treat as a strong directional signal, not a universal multiplier.

Sources & methodology. PR counts: git log --grep "Merged PR" on main, cross-checked against the Azure DevOps PR API (renovate-bot PRs excluded). Churn classification: git log --numstat bucketed by path (app vs. generated vs. vendored vs. planning). Benchmarks — DX Core 4 Benchmark 2024 (51k developers); LinearB 2025 Engineering Benchmarks (8.1M PRs, 4,800 teams); DORA 2024 State of DevOps; METR 2025 RCT; PR-size norm 200–400 lines (SmartBear / Google / LinearB).