Large Language Models (LLMs) have become essential tools for software developers. With their increasing usage, it's crucial to understand what tokens are, how to count them, and how to manage their costs effectively.
A token is a piece of text that the model processes as a single unit. This could be as short as one character or as long as one word, depending on the language and the specific word. For example, the phrase "Hello, world!" might be broken down into the following tokens: "Hello", ",", "world", and "!". In essence, tokens are the building blocks that LLMs use to understand and generate language.
Encodings are the methods used to convert text into tokens that the model can understand. Different models use different encoding schemes, and this can affect how text is tokenized. For example, OpenAI models like GPT-4 and GPT-3.5 use the cl100k_base
encoding. These encodings determine how efficiently the model processes the text and, ultimately, the cost of using the model. Other LLM providers, such as Hugging Face or Google's BERT, use their own unique tokenization methods, which can vary significantly in how they segment text into tokens.
We can use tiktoken
to count tokens for OpenAI models. This library uses WebAssembly for efficient token counting. To get started, install it using:
Here's a snippet to count tokens in a JSON file containing an array of items:
Most newer models in the GPT family use cl100k_base
encoding. This includes models like GPT-4, GPT-3.5, and even embedding models such as text-embedding-3-small
and text-embedding-3-large
.
I created this simple calculator for you to count tokens directly in your browser. You can upload files or enter text and check total token count, max token count (if your input is an array of items) and average token count.
To understand the cost implications, you can check OpenAI's pricing here. Remember that token count can also impact the performance and accuracy of the responses generated by the models.
tiktoken
for efficient counting with OpenAI models.