People ask us this constantly: "what should I use?" The answer is always boring — pick the thing that will let your team ship, not the thing that will look best in a tech blog post. With that caveat, here's our default stack as of Q2 2026.
The model layer
Primary: Claude Sonnet 4
Our default for everything that involves reasoning, long context, or following complex instructions. It's better than GPT-4o at multi-step tasks, follows system prompts more reliably, and is significantly cheaper for high-volume use. We've shipped 28 of our last 30 projects on Sonnet 4.
For deep reasoning: Claude Opus 4
Used for the 5% of calls where you need real depth — financial analysis, legal reasoning, complex code review. We don't use Opus for production user-facing flows because of latency, but we use it inside async pipelines and offline evaluation.
For dirt-cheap volume: Claude Haiku
When you need to classify, route, or do quick extraction over millions of items, Haiku is 1/10th the cost of Sonnet and good enough. Use it for routing prompts to other models, summarising chunks, or doing first-pass labeling.
What we use OpenAI for
Whisper for speech-to-text (still the best), embeddings (text-embedding-3-large), and GPT-4o for tasks that specifically need image understanding inside a chat loop. We don't use GPT-4o as a primary text model anymore.
"Use Claude as your default, OpenAI for specific specialties, Google for specific specialties." That's our entire model strategy in 2026.
The orchestration layer
For agent workflows: TypeScript + Anthropic SDK
We don't use LangChain anymore. The abstractions get in the way for production work and the framework keeps reshaping itself. Plain TypeScript + the Anthropic SDK + good function-calling discipline is more code but significantly more maintainable.
For non-engineering workflows: n8n
When the workflow doesn't need custom code — connect Shopify to a model to Slack — n8n self-hosted is faster to ship and easier for the client's ops team to maintain after handoff. Zapier is fine for very simple flows; n8n is the right call once there are 4+ steps.
For RAG: pgvector + custom
We've migrated away from Pinecone for almost all client work. Postgres with pgvector handles up to ~5M embeddings comfortably, it's already in the client's stack, and it has zero per-query cost. Pinecone is still right for some very specific cases (multi-tenant SaaS embeddings at scale), but it's not the default anymore.
The infrastructure layer
- Hosting: Vercel for Next.js apps, Cloudflare Workers for edge stuff. Fly.io if there's GPU work.
- Database: Postgres on Supabase or Neon. Both fine. We pick based on the client's existing relationships.
- Auth: Clerk for B2B SaaS. Supabase Auth if Postgres already there.
- Files: Cloudflare R2 or S3. R2 wins on cost for large objects.
- Observability: Sentry for errors, PostHog for product analytics, Datadog or BetterStack for infra.
- LLM observability: Langfuse self-hosted. We've tried most of them. Langfuse is currently the cleanest.
The eval layer
This is where almost no one invests enough. We use:
- Promptfoo for batch eval runs against a frozen test set
- Langfuse for production-trace-based evals
- LLM-as-judge with Claude Opus 4 for rubric-based scoring of generative outputs
- Human review queues for the top 5% most ambiguous outputs (a Slack-based mini-tool we built)
What we don't use (and why)
- LangChain. Abstractions don't match real-world failure modes. Hard to debug.
- AutoGen / CrewAI. Agent orchestration frameworks are still too unstable for production.
- Vector DBs other than pgvector and Pinecone. Most of the others solve problems most clients don't have.
- Fine-tuning by default. Sonnet 4 with a good system prompt + RAG beats fine-tuned smaller models in 90% of cases. Only fine-tune when the cost math forces it.
This will be wrong in 6 months
Every line of this article will be revisable by Q4 2026. The stack churns. The principles don't:
- Pick the boring choice.
- Optimise for ability to ship, not theoretical maximums.
- Invest in evals before sophistication.
- Stay one layer below the bleeding edge — last quarter's frontier is this quarter's stable.
If you'd like our latest version of this list or want our take on a stack you're considering, just ask — happy to share what we've shipped.