Most agent tutorials show you the same thing: a while loop, a model call, a tool dispatch, and a break condition. It runs. It works on the demo. Then you try to ship it and realize the loop is the least interesting part of the system.
The model is one component. The system around it is what determines whether the thing is reliable, observable, resumable, and safe to operate. Get that system wrong and you will be rebuilding it under production pressure, in the dark, with real user data at stake.
This post lays out the full architecture: decision framework, reference components, the five memory types, tool patterns, the run store schema, failure recovery, observability, governance, and cost control. TypeScript and Postgres throughout. No vendor diagrams.
Before writing any code, name the thing correctly. Most systems people call "agents" are workflows. A workflow is a state machine with deterministic steps: if you know the sequence of operations before the run starts, you don't need an LLM for sequencing. You need it only for the hard parts inside each step.
The decision breaks down like this:
Steps known in advance, no adaptation needed. Build a workflow. Use a state machine. The LLM is a tool called from specific states, not the conductor. This is the right choice for most "agentic" features that have a fixed happy path.
Steps require planning or the agent must adapt to unexpected tool results. Use a single agent with a bounded loop. The LLM decides what to do next, but your runtime enforces a maximum step count, a token budget, and a fail-closed policy on errors.
Subtasks need different specialisations or need to run concurrently. Multi-agent. One orchestrator breaks down the problem, dispatches sub-agents, collects results. The complexity cost is real - coordination, partial failures, state merging. Only pay it when single-agent hits a genuine ceiling.
Build the lightest thing that does the job. You can always add an agent to a workflow. Untangling an over-built multi-agent system from a feature that needed a state machine is much harder.
Here is the reference architecture. Each layer has one job.
Decision layer. The LLM call. It receives the current context and a list of available tools, and returns either a tool call or a final answer. This layer is stateless. It does not know about previous runs, retry counts, or token budgets. That is not its job.
Runtime / executor. Your code. This is the loop. It owns iteration limits (hard cap, not a suggestion), routes tool calls to the right handler, persists every step to the run store before moving to the next one, and enforces the fail-closed policy. If the decision layer is the engine, the runtime is the chassis, the brakes, and the black box recorder.
Occasional notes on software, tools, and things I learn. No spam.
Unsubscribe anytime.
Tool layer. Typed functions with Zod-validated args, idempotency keys, timeout wrappers, and optional dry-run mode. Every tool call goes through this layer. Nothing calls external systems directly.
Memory layer. Five distinct types. Covered in the next section because conflating them is one of the most common architectural mistakes.
Run store. Three Postgres tables: agent_runs, agent_steps, agent_tool_calls. Every run, every step, every tool call is a row. This is your audit log, your replay substrate, and your eval dataset.
Policy / guardrail layer. Approval gates for destructive or high-stakes actions. Least-privilege scoping per tool. Fail-closed behavior: any unexpected error halts the run. It does not retry indefinitely.
Observability layer. Structured logs per step, attached to agent_steps. Eval hooks that run over real run store data, not synthetic test cases.
"Memory" is the most overloaded word in agent architecture. It refers to at least five different things that live in different places, have different lifetimes, and require different write strategies.
1. In-context (short-term). The messages array passed to the current LLM call. Lives in RAM. Evicts at the context window limit. This is not durable. It exists only for the duration of a single model call. The runtime is responsible for deciding what to put here: full step history, a summary, or a sliding window.
2. Episodic / run memory. The step log in agent_steps. Every action the agent took, every tool result, every model output, written to Postgres before the next step begins. This survives crashes. When you need to resume a failed run, you reconstruct state from this table. When a run produces a bad output, you debug from this table.
-- agent_steps key columns
run_id UUID REFERENCES agent_runs(id),
step_index INTEGER NOT NULL,
model TEXT NOT NULL,
messages JSONB, -- full messages array for this step
input_tokens INTEGER,
output_tokens INTEGER,
latency_ms INTEGER3. Semantic memory. Embeddings stored in pgvector. Retrieved at each step via cosine similarity search. This is the RAG layer - the agent's access to knowledge that does not fit in context. See building RAG with Postgres and pgvector for the full pipeline.
4. Tool memory. State that a tool manages internally across calls. A browser automation tool holds a session ID. A file-editing tool tracks an open handle. This state lives inside the tool layer, not in the agent runtime. The tool is responsible for its own lifecycle. The runtime only sees inputs and outputs.
5. User / profile memory. Long-lived facts about the user or the environment. A separate table, written deliberately. Not auto-generated by the model. If the agent infers a user preference and writes it automatically, you get drift and noise. Write to this table only when there is an explicit signal.
CREATE TABLE agent_user_memory (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
source TEXT, -- 'user_explicit' | 'agent_inferred' | 'admin'
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (user_id, key)
);Each memory type has a different owner, a different write trigger, and a different failure mode if you get it wrong. Treat them as separate systems.
The tool is where things go wrong. Rate limits, timeouts, partial writes, side effects on retry. Most agent reliability problems trace back to the tool layer being treated as a thin wrapper around an API call.
The interface that every tool in the system implements:
interface Tool<TArgs, TResult> {
name: string;
schema: ZodType<TArgs>;
idempotencyKey?: (args: TArgs) => string;
execute: (args: TArgs, ctx: ToolContext) => Promise<TResult>;
}
interface ToolContext {
runId: string;
stepId: string;
dryRun?: boolean;
approvalRequired?: boolean;
}Four patterns every tool needs:
Idempotency keys. Hash the args to a stable key. Before executing, check agent_tool_calls for a row with that key on the current run. If it exists and succeeded, return the cached result. If the LLM retries a step because the runtime resumed from a checkpoint, you do not re-send an email or re-charge a card.
Timeout wrapper. Every execute call races against a timeout. Default to something aggressive: five seconds for a lookup, thirty for a write operation. A tool that hangs indefinitely stalls the run and burns tokens in the next retry. Use Promise.race with a rejection timeout.
Dry-run mode. Destructive tools (delete, send, publish, charge) check ctx.dryRun before execution. In dry-run mode, they validate args and return a preview of what would happen. The runtime sets this flag for new tool types being tested in staging.
Approval gate. If ctx.approvalRequired, the tool does not execute. Instead, the runtime writes status = 'awaiting_approval' to agent_runs, persists the pending tool call, and notifies a human. On confirmation, the run resumes from that step.
If you are hosting tools over the network, building an MCP server is worth reading. For NestJS-hosted tool servers, MCP server with NestJS covers the wiring.
Three tables. Everything that happens in an agent run is a row in one of them.
-- agent_runs: one row per agent invocation
CREATE TABLE agent_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_name TEXT NOT NULL,
status TEXT NOT NULL, -- 'running' | 'done' | 'failed' | 'awaiting_approval'
input JSONB NOT NULL,
output JSONB,
error TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
finished_at TIMESTAMPTZ,
token_cost INTEGER
);
-- agent_steps: one row per LLM call within a run
CREATE TABLE agent_steps (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
run_id UUID REFERENCES agent_runs(id),
step_index INTEGER NOT NULL,
model TEXT NOT NULL,
messages JSONB,
input_tokens INTEGER,
output_tokens INTEGER,
latency_ms INTEGER,
created_at TIMESTAMPTZ DEFAULT now()
);
-- agent_tool_calls: one row per tool call
CREATE TABLE agent_tool_calls (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
step_id UUID REFERENCES agent_steps(id),
tool_name TEXT NOT NULL,
args JSONB NOT NULL,
result JSONB,
error TEXT,
idempotency_key TEXT,
latency_ms INTEGER,
created_at TIMESTAMPTZ DEFAULT now()
);Why this schema earns its complexity:
agent_steps. Resume a failed run without re-executing successful steps.agent_steps inputs and compare outputs. No synthetic test cases needed.token_cost on agent_runs aggregated from input_tokens + output_tokens across all steps. Per-agent, per-user, per-day.Agents fail mid-run. The LLM API returns 500. A tool times out. The process crashes between steps. The question is not whether this happens but how gracefully you handle it.
The recovery pattern depends on the run store being written to before moving to the next step. If step 4 wrote its row and step 5 crashed before writing, resume means: query the last agent_steps row for the run, reconstruct the messages array from the step log, re-enter the loop at step 5.
For LLM API failures: exponential backoff with three retries. If all three fail, set agent_runs.status = 'awaiting_approval' and error to the API error. Do not keep retrying. A human should inspect it. When the API comes back, resume from the last successful step.
For tool failures: the idempotency key handles retries. Before executing any tool call on a resumed run, check agent_tool_calls for a matching key with a non-null result. If found, use the cached result. If the previous call recorded an error, decide whether to retry or surface the failure to the runtime.
The full control flow decision tree - what to do at each branch of the loop, how to handle partial tool results, how to bound the iteration count - is in agent control flow patterns.
Log structured data per step, not per run. Per-run aggregates are useful for dashboards. Per-step logs are what you actually debug with.
Per step, capture: step_index, model, input_tokens, output_tokens, latency_ms, the full messages array, and every tool call with its args, result, latency, and error.
Per run, capture: total token_cost, wall-clock duration, final status, and a metadata JSONB for anything run-specific (user ID, session ID, feature flag state).
The eval setup that actually works: run your eval harness against agent_steps rows. Pick a set of runs where you know the correct output. Replay each run's inputs through a new model or new prompt. Compare outputs. The eval dataset is your real production traffic, not a curated fixture set. This means your evals find real failure modes, not the ones you imagined.
Some tool calls should require human sign-off. Publishing to production, deleting records, sending external communications, spending money. The approval pattern:
approvalRequired returns true for the current args.agent_runs.status = 'awaiting_approval', writes the pending tool call to agent_tool_calls with result = null.status = 'failed', write an error.The run is paused, not abandoned. All prior steps remain in the run store. Cost is not wasted.
Beyond approval gates: least privilege per tool. A tool that reads documents should not have a credential that can delete them. Scope credentials to the minimum surface the tool needs. For code execution, run in a sandbox. Any unexpected tool error halts the run with status = 'failed'. Do not catch-and-continue. Fail closed.
Every LLM call has a cost. Without tracking it, a single rogue agent can burn a week's budget in an afternoon.
Track token_cost on agent_runs as a running total updated after each step. Before each LLM call, check if token_cost + estimated_next_call_cost > budget. If it would exceed budget: fall back to a cheaper model if the task allows it, or halt the run with a budget-exceeded error.
For 429s, exponential backoff with jitter: don't retry all concurrent runs at the same interval or you just recreate the spike. Track retry counts per step. After three 429 retries, surface to awaiting_approval.
If cost is a hard constraint, consider self-hosting a smaller model for the cheaper steps in your pipeline. The fallback hierarchy does not have to be two commercial models.
Here is the truth about where agent reliability actually comes from: it is not the model. You can upgrade the model and your run reliability does not change much. You can swap models entirely - across vendors, across model families - and if the run store, typed tools, and approval gates are in place, the system keeps working.
What you cannot swap is a bad architecture. A run store you did not build from the start means weeks of backfilling audit capability. Stateful tool calls without idempotency keys means duplicate side effects you spend months hunting down. No approval gate means an agent that published to production when it should have asked.
The components around the loop are the product. The model is the part you get to swap.