AI agent architecture patterns that survive production

May 18, 2026

Most agent tutorials show you the same thing: a while loop, a model call, a tool dispatch, and a break condition. It runs. It works on the demo. Then you try to ship it and realize the loop is the least interesting part of the system.

The model is one component. The system around it is what determines whether the thing is reliable, observable, resumable, and safe to operate. Get that system wrong and you will be rebuilding it under production pressure, in the dark, with real user data at stake.

This post lays out the full architecture: decision framework, reference components, the five memory types, tool patterns, the run store schema, failure recovery, observability, governance, and cost control. TypeScript and Postgres throughout. No vendor diagrams.

Workflow, agent, or multi-agent

Before writing any code, name the thing correctly. Most systems people call "agents" are workflows. A workflow is a state machine with deterministic steps: if you know the sequence of operations before the run starts, you don't need an LLM for sequencing. You need it only for the hard parts inside each step.

The decision breaks down like this:

Steps known in advance, no adaptation needed. Build a workflow. Use a state machine. The LLM is a tool called from specific states, not the conductor. This is the right choice for most "agentic" features that have a fixed happy path.

Steps require planning or the agent must adapt to unexpected tool results. Use a single agent with a bounded loop. The LLM decides what to do next, but your runtime enforces a maximum step count, a token budget, and a fail-closed policy on errors.

Subtasks need different specialisations or need to run concurrently. Multi-agent. One orchestrator breaks down the problem, dispatches sub-agents, collects results. The complexity cost is real - coordination, partial failures, state merging. Only pay it when single-agent hits a genuine ceiling.

Build the lightest thing that does the job. You can always add an agent to a workflow. Untangling an over-built multi-agent system from a feature that needed a state machine is much harder.

The seven layers

Here is the reference architecture. Each layer has one job.

Decision layer. The LLM call. It receives the current context and a list of available tools, and returns either a tool call or a final answer. This layer is stateless. It does not know about previous runs, retry counts, or token budgets. That is not its job.

Runtime / executor. Your code. This is the loop. It owns iteration limits (hard cap, not a suggestion), routes tool calls to the right handler, persists every step to the run store before moving to the next one, and enforces the fail-closed policy. If the decision layer is the engine, the runtime is the chassis, the brakes, and the black box recorder.

Join My Newsletter

Occasional notes on software, tools, and things I learn. No spam.

Unsubscribe anytime.

AI agent architecture patterns that survive production

Workflow, agent, or multi-agent

The seven layers

Join My Newsletter

The five memories

The tool layer

The run store

Failure recovery

Observability and evals

Governance and containment

Cost and rate-limit control