The pattern
You've seen this play out, or you've lived it. A small team — sometimes just one ambitious engineer — builds an AI prototype over a weekend. It's slick. It's working on a real-ish dataset. The Loom video circulates. Leadership says "ship this." The team says "give us a month." Six months later, nothing has shipped.
This isn't a story about lazy teams. The engineers who build great demos are usually the best engineers in the company. The problem is structural: demos and production systems are different objects with different requirements, and treating one as a draft of the other is the bug.
Why demos demo so well
The demo runs on:
- A hand-picked happy-path input
- Fresh model weights (the latest Claude / GPT)
- A creator-curated knowledge base of a few hundred documents
- Zero authentication, zero rate-limiting, zero compliance review
- The engineer driving the demo, who knows where every rough edge is
None of that is true in production. The real input is messy. Customers ask weird things. Knowledge bases are 50,000 documents with stale entries. Authentication, audit logs, observability, escalation paths — all need to exist. And no engineer is in the loop guiding away from the rough edges.
The 5 root causes we see most often
1. The demo was built without an eval set
The prototype works for the 5 prompts the engineer tried. The team has no idea if it works for the 50 representative cases of real production traffic. When they start testing, half the cases fail, the team falls into a "tune one, break two" loop, and momentum dies.
The fix: build the eval set before the second iteration of the prompt. Even 20 hand-graded cases is dramatically better than 0.
2. The data was clean. The real data isn't.
The demo ran on a curated sample. Production data is dirty: misformatted, inconsistent, duplicated, sometimes missing. The team didn't budget for the data-cleaning work because the demo data was already clean.
The fix: in week one, take a random sample of real production data through the pipeline. Don't sample-clean. Find out how dirty it is.
3. The integration work scared everyone
The demo speaks to a local file. The production version needs to authenticate against your CRM, write back to your invoicing system, post to Slack, log to your observability pipe, and respect your data-residency rules. That's 8 weeks of integration on top of "the demo."
The fix: scope the integration in the brief, before the demo is built. The demo should mock the integrations but assume they'll be real.
4. Ownership was never handed over
The engineer who built the demo is now on the next thing. The product team owns the roadmap but can't maintain the prompts. The data team owns the data but doesn't understand the model. Nobody picks up the work and the project quietly stalls.
The fix: name the owner before the demo is built. "Sarah on the customer success team owns this after week 4" — and Sarah should be in the room during the demo, asking questions.
5. The success metric was vibes
"The demo went well" is not a metric. Two months in, leadership asks "is this working?" — and the team can't answer with numbers. Confidence erodes. Budget gets pulled.
The fix: define the success metric on day one, with a numeric baseline and target. "Auto-resolve 60% of tier-1 tickets with CSAT no lower than current baseline of 4.3". That sentence is more valuable than the entire demo.
Demos answer "can we?". Production answers "should we, will we, and how will we know?". They are not the same question and they require different work.
The shape of a demo that ships
The demos that survive contact with production share four properties:
- They run on real data from day one. Not curated. Not happy-path. The engineer ran a sample of last week's actual customer traffic through the pipeline and the demo addresses what came out, not what would have been ideal.
- They include an eval report. Not "it works" — a one-page document that says "we tested 50 cases representative of production, 38 passed, here are the 12 failures, here's our hypothesis on each." Investors of internal time love this; founders should require it.
- They name the owner. Who maintains the prompts in month two? Who escalates when the model deprecates? If the answer is "the engineer who built it", the project will fail in month four when that engineer moves to something else.
- They include the scope of integration work. The demo shows "model in, model out". The plan shows the systems it has to talk to, the SLAs, the security review. Nobody loves this slide, but every project that ships had it.
The "90% before / 100% on demo day" rule
Most demo-day failures are presentation failures. The system is fine but something specific breaks on stage. To prevent this we run a "90% before" rule:
- Two days before the demo, the system has to work end-to-end with 90% of test cases passing.
- The day of the demo, the team has only made cosmetic and prompt changes — no architecture changes since the 90% gate.
- If the 90% gate isn't met by two days out, the demo is moved. No exceptions.
This sounds bureaucratic; it eliminates 90% of the "the demo broke 5 minutes before the meeting" disasters.
What we do with clients
When we run our 30-day sprint, week 4 is dedicated to making the demo into a production system, not the other way around. The demo happens at the end of the sprint, when the system has already been hardened, evaluated and integrated. The demo is a milestone, not a goal.
For clients who come to us after a stuck demo, the work is usually: rebuild the eval suite, harden the integrations, name an internal owner, and rewrite the brief with a real success metric. About 60% of the time the original prototype is salvageable; 40% of the time we re-architect to a degree.
What this means for you, today
If you have an AI demo that's been sitting in "we'll ship it soon" land for more than two months:
- Audit it against the 5 root causes above. Which one applies?
- Build an eval set this week. Even a small one.
- Name the production owner. Not the demo's creator — the operational owner who will maintain it.
- Write the success metric in one sentence with a number and a date.
If you do those four things, your demo has dramatically better odds of becoming a shipped system in the next 60 days.
Have an AI demo that's been stuck? Book a 30-minute audit. We'll look at where it's stuck and tell you, honestly, whether it's worth pushing through, refactoring, or starting over.