The moment that pushes most people over the edge is an invoice. A staging environment left running overnight, an eval loop that fired 10,000 completions, a context window that grew by accident. You open the dashboard and start doing the math on what this costs at production scale.
So: self-hosting. The idea is simple. The execution is where people stall, usually because every tutorial assumes you already know whether you want Ollama, vLLM, llama.cpp, LM Studio, or TGI — and never explains why it picked the one it's showing.
This is the checklist I wish existed.
There are really only three decisions:
You want it running in 60 seconds with an OpenAI-compatible endpoint and zero ops: Use Ollama. It's a single binary, it downloads models for you, and it speaks the OpenAI API format out of the box. Point your existing OpenAI SDK client at http://localhost:11434/v1 and nothing else in your codebase changes.
You're deploying to a GPU server and need real throughput under concurrent load: Use vLLM. It uses PagedAttention to serve multiple requests at once, which Ollama doesn't do. On an RTX 4090, vLLM can push 3-4x more tokens per second at peak compared to Ollama because it batches incoming requests together. The cost is a Docker container and CUDA 12.1+.
You want the model embedded in your process with no HTTP or sidecar: Use node-llama-cpp. It ships pre-built TypeScript bindings to llama.cpp, loads GGUF models directly in Node, and has built-in JSON schema enforcement at the token level — useful for CLI tools and scripts.
The pattern that actually works for small teams: Ollama in dev, vLLM in prod.
Any specific model list in a blog post goes stale fast — the field moves that quickly. Better to know how to pick for yourself.
Step 1: Know your VRAM ceiling first. This is the hard constraint that determines your size class. Use the formula in the hardware section below. 8GB gets you sub-8B models. 16-24GB opens up the 14-27B range. Beyond that, you can run 70B+ with quantisation.
Step 2: Match the benchmark to your task. Older benchmarks (HumanEval, MMLU) have a contamination problem — models have essentially trained on the test data, so they no longer separate good from great. The ones that are still meaningful:
Occasional notes on software, tools, and things I learn. No spam.
Unsubscribe anytime.
Go to the HuggingFace Open LLM Leaderboard, filter to your size class, and sort by whichever benchmark matches your workload. The leaderboard updates continuously as new models drop.
Step 3: Check Ollama availability. A model that scores well on HuggingFace isn't useful if there's no GGUF for Ollama yet. Check ollama.com/library — if it's there, one ollama pull and you're running. If not, you can load it in vLLM directly from HuggingFace, which is fine for production but adds a step in local dev.
Step 4: Run your actual prompts before committing. Benchmarks measure benchmark performance. Pull the top 2-3 candidates in your size class and run the prompts your app actually uses. Latency, coherence in your specific domain, and JSON output reliability vary more than leaderboard scores suggest.
On quantisation: Q4_K_M is the practical default. It cuts memory to roughly half of FP16 with a quality loss that's barely measurable on real outputs. Q5_K_M buys ~5% more quality for ~25% more VRAM. Q8 is for benchmarking, not running on a laptop. Avoid Q2 — the quality drop is noticeable on anything requiring consistent reasoning.
The rule of thumb: model parameter count × 0.5 ≈ VRAM needed in GB at Q4_K_M, plus 1-2GB for the KV cache at normal context lengths.
Concretely:
Token speed on real hardware with Ollama, 8B model at Q4_K_M:
14B models on the same hardware run roughly 30-40% slower. These numbers shift as runtimes improve, but the hardware tier ratios stay stable.
50 tok/s is good. It's faster than most people read and fast enough to stream in a UI without feeling slow. Under 10 tok/s on a large model means you need to go smaller or add hardware.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull your-model-name # e.g. qwen3:8b, llama3.3:8b, gpt-oss-20bOllama runs as a background service after install. You now have an OpenAI-compatible API at http://localhost:11434/v1.
Point the OpenAI SDK at the local endpoint:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: process.env.LLM_BASE_URL ?? "http://localhost:11434/v1",
apiKey: process.env.LLM_API_KEY ?? "ollama",
});The apiKey is required by the SDK but ignored by Ollama — 'ollama' is the convention. Using env vars here means the exact same code works in dev (pointing at localhost) and production (pointing at your GPU server) with no changes.
Streaming:
const stream = await client.chat.completions.create({
model: "your-model-name",
messages: [{ role: "user", content: prompt }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}JSON output:
const response = await client.chat.completions.create({
model: "your-model-name",
messages: [
{
role: "system",
content:
'You are a classifier. Respond with JSON only: { "label": string, "confidence": number }',
},
{ role: "user", content: textToClassify },
],
response_format: { type: "json_object" },
});
const result = JSON.parse(response.choices[0].message.content ?? "{}") as {
label: string;
confidence: number;
};Ollama supports json_object format natively. For stricter schema enforcement where non-conforming tokens are rejected at generation time, node-llama-cpp has that built in — more reliable than prompt instructions alone when you need a guarantee.
Step 1: Rent a GPU.
Vast.ai and RunPod are the practical options. RTX 4090 on Vast.ai runs ~$0.30-0.35/hr. A100 80GB is ~$0.50-0.60/hr. H100 is ~$1.50-1.70/hr. For most Node workloads, a single RTX 4090 handles 5-10 concurrent requests at 8B model size before throughput starts to degrade.
Step 2: Run vLLM on the GPU server.
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-7B-Instruct \
--max-model-len 8192vLLM exposes the same OpenAI-compatible API as Ollama. Change LLM_BASE_URL to point at port 8000 and nothing else changes in your code.
Step 3: Put Caddy in front with bearer token auth.
llm.yourapp.com {
@authorized header Authorization "Bearer {$LLM_API_KEY}"
handle @authorized {
reverse_proxy localhost:8000
}
respond 401
}
{$LLM_API_KEY} reads from Caddy's environment. Any request without the matching Authorization header gets a 401 before it touches the model. This is enough for internal APIs where you control the callers. Per-user rate limiting can wait until you actually need it.
Step 4: Update your env and ship.
LLM_BASE_URL=https://llm.yourapp.com/v1
LLM_API_KEY=your-secret-keySame client. Same code. Different env vars.
Single-threaded inference isn't what you think. vLLM's continuous batching interleaves tokens from multiple concurrent requests, so throughput under load is much better than it looks. You hit the ceiling when the batch size saturates GPU memory, not when a second request comes in. The fix is horizontal: multiple vLLM instances behind a load balancer.
KV cache grows with context length. Every token in the context window lives in GPU memory as a KV cache entry. A 70B model at 32k context fills an A100 fast. Keep max-model-len short (4k-8k) until you actually need more, and watch vLLM's gpu_memory_utilization metric before you assume you have headroom.
Cold starts hurt on big models. A 70B model takes 2-4 minutes to load from disk into VRAM. Ollama and vLLM both keep loaded models warm, but if the server restarts or you swap models under load you'll feel it. Keep one model per server and don't hot-swap.
Prompt injection is still your problem. If user input goes directly into a system prompt or tool call without sanitization, you're exposed the same way you would be with a SQL string. The model being local doesn't change that boundary.
The path from "I'm calling OpenAI" to "I'm running my own endpoint" is shorter than it looks: Ollama locally, vLLM on a rented GPU, Caddy in front, same TypeScript client throughout. The tricky parts are hardware sizing and recognizing when Ollama's single-user model stops being enough — which for most teams is later than they expect.
At the usage volumes where self-hosting makes financial sense, your data also stops leaving your infrastructure. That's not a small thing.