Self-hosting an LLM as a TypeScript Developer

May 12, 2026

The moment that pushes most people over the edge is an invoice. A staging environment left running overnight, an eval loop that fired 10,000 completions, a context window that grew by accident. You open the dashboard and start doing the math on what this costs at production scale.

So: self-hosting. The idea is simple. The execution is where people stall, usually because every tutorial assumes you already know whether you want Ollama, vLLM, llama.cpp, LM Studio, or TGI — and never explains why it picked the one it's showing.

This is the checklist I wish existed.

Pick a runtime first

There are really only three decisions:

You want it running in 60 seconds with an OpenAI-compatible endpoint and zero ops: Use Ollama. It's a single binary, it downloads models for you, and it speaks the OpenAI API format out of the box. Point your existing OpenAI SDK client at http://localhost:11434/v1 and nothing else in your codebase changes.

You're deploying to a GPU server and need real throughput under concurrent load: Use vLLM. It uses PagedAttention to serve multiple requests at once, which Ollama doesn't do. On an RTX 4090, vLLM can push 3-4x more tokens per second at peak compared to Ollama because it batches incoming requests together. The cost is a Docker container and CUDA 12.1+.

You want the model embedded in your process with no HTTP or sidecar: Use node-llama-cpp. It ships pre-built TypeScript bindings to llama.cpp, loads GGUF models directly in Node, and has built-in JSON schema enforcement at the token level — useful for CLI tools and scripts.

The pattern that actually works for small teams: Ollama in dev, vLLM in prod.

Pick a model

Any specific model list in a blog post goes stale fast — the field moves that quickly. Better to know how to pick for yourself.

Step 1: Know your VRAM ceiling first. This is the hard constraint that determines your size class. Use the formula in the hardware section below. 8GB gets you sub-8B models. 16-24GB opens up the 14-27B range. Beyond that, you can run 70B+ with quantisation.

Step 2: Match the benchmark to your task. Older benchmarks (HumanEval, MMLU) have a contamination problem — models have essentially trained on the test data, so they no longer separate good from great. The ones that are still meaningful:

Coding: SWE-bench Verified (real GitHub issues, hard to game) and LiveCodeBench (refreshed monthly)
Reasoning and math: GPQA Diamond and AIME
— human-preference rankings from millions of blind side-by-side comparisons, not self-graded by another model

Join My Newsletter

Occasional notes on software, tools, and things I learn. No spam.

Unsubscribe anytime.

Self-hosting an LLM as a TypeScript Developer

Pick a runtime first

Pick a model

Join My Newsletter

Hardware sizing

Get Ollama running

Call it from TypeScript

From local to production

What breaks at scale