Counting GPT tokens

Large Language Models (LLMs) have become essential tools for software developers. With their increasing usage, it's crucial to understand what tokens are, how to count them, and how to manage their costs effectively.

What is a Token?

A token is a piece of text that the model processes as a single unit. This could be as short as one character or as long as one word, depending on the language and the specific word. For example, the phrase "Hello, world!" might be broken down into the following tokens: "Hello", ",", "world", and "!". In essence, tokens are the building blocks that LLMs use to understand and generate language.

Encodings and Models

Encodings are the methods used to convert text into tokens that the model can understand. Different models use different encoding schemes, and this can affect how text is tokenized. For example, OpenAI models like GPT-4 and GPT-3.5 use the cl100k_base encoding. These encodings determine how efficiently the model processes the text and, ultimately, the cost of using the model. Other LLM providers, such as Hugging Face or Google's BERT, use their own unique tokenization methods, which can vary significantly in how they segment text into tokens.

Let’s Count

We can use tiktoken to count tokens for OpenAI models. This library uses WebAssembly for efficient token counting. To get started, install it using:

npm install tiktoken

Here's a snippet to count tokens in a JSON file containing an array of items:

import { get_encoding } from "tiktoken";
import fs from "fs";
 
const filePath = process.argv[2];
const content = fs.readFileSync(filePath, "utf-8");
const parsedContent: any[] = JSON.parse(content);
 
const tokens: number[] = [];
const enc = get_encoding("cl100k_base");
 
for (const item of parsedContent) {
    const encodedContent = enc.encode(JSON.stringify(item));
    tokens.push(encodedContent.length);
}
 
const total = tokens.reduce((acc, cur) => acc + cur, 0);
const average = total / tokens.length;
const max = Math.max(...tokens);
const min = Math.min(...tokens);
 
console.log("Total", total.toLocaleString());
console.log("Average", average.toLocaleString());
console.log("Max", max.toLocaleString());
console.log("Min", min.toLocaleString());
console.log("Avg/Max average", ((average + max) / 2).toLocaleString());
 
enc.free();

Most newer models in the GPT family use cl100k_base encoding. This includes models like GPT-4, GPT-3.5, and even embedding models such as text-embedding-3-small and text-embedding-3-large.

I created this simple calculator for you to count tokens directly in your browser. You can upload files or enter text and check total token count, max token count (if your input is an array of items) and average token count.

Encoding

Token Count

-

Max count (if array)

-

Average Count (if array)

-

File will be parsed in the browser and won't be uploaded to the server. Only JSON is supported for now.

To understand the cost implications, you can check OpenAI's pricing here. Remember that token count can also impact the performance and accuracy of the responses generated by the models.

tl;dr

Tokens: Basic units of text processed by LLMs.
Encodings: Methods to convert text into tokens, differing across LLM providers.
Token Counting: Use tiktoken for efficient counting with OpenAI models.
Impact on Cost: Token count affects both cost and performance.

Online