Guide

How AI API Tokens Work: A Plain-English Guide

When you send a message to an AI model through an API, the text does not travel as raw characters. Instead, it is first split into tokens — the fundamental unit of measurement for large language models. Understanding tokens matters because every API provider charges per token consumed, and the same budget can stretch very differently depending on the model you choose. A token is roughly three-quarters of a word in English: the word “token” is one token, “tokenization” is two or three, and a space often merges with the following word. In practice, 1,000 tokens correspond to approximately 750 words of prose. Punctuation, whitespace, and non-Latin scripts each tokenize differently, so exact counts vary, but the rule of about 0.75 words per token is reliable enough for budgeting. Providers publish per-million-token rates, so dividing your word count by 750 and multiplying by the rate gives you a fast estimate. The Zylo pricing calculator automates this arithmetic for every model in the catalogue.

Input tokens versus output tokens

Every API call involves two distinct token streams: the input (also called the prompt) and the output (also called the completion). Input tokens encompass everything you send — your system prompt, conversation history, and the current user message. Output tokens are the words the model generates in reply. Providers meter these separately because generation is computationally more expensive than encoding: the model must run a full forward pass for every single output token it produces, whereas the entire input is processed in a single parallel pass. This asymmetry is reflected in the pricing tables. At Zylo, for example, Claude Opus 4.8 costs $5 per million input tokens and $25 per million output tokens (prices as of June 2026; check the live model catalogue for current rates). A flagship model that writes a 2,000-word report costs roughly five times more on the output side than on the input side for the same word count. Choosing a model with a lower output rate — or instructing the model to be concise — can reduce costs substantially.

The context window

Every model has a context window: the maximum number of tokens, input plus output combined, that fit inside a single API request. Think of it as the model’s working memory. If your conversation history plus your new message plus the expected reply exceeds the window, the oldest tokens must be truncated or summarized before sending. A model with a 200,000-token context window can hold roughly 150,000 words in a single call — equivalent to a full novel — while a model with a 32,000-token window is better suited to shorter documents. Larger contexts are useful for tasks like full-codebase review or lengthy contract analysis, but they can also increase your input token bill if you inadvertently pass more text than the task requires. Keeping system prompts tight and trimming stale conversation history are two of the most effective ways to control cost without switching models. When your use case only demands short exchanges, a smaller-context, lower-cost model is almost always the better choice; the full cost breakdown guide walks through how to match model capabilities to real workloads.

Counting and trimming tokens before you send

Because tokens drive cost, it pays to estimate and trim them before a request ever leaves your application. Most SDKs and a handful of small libraries can count tokens locally so you know the size of a prompt in advance, which lets you catch a request that would overflow the context window or blow past a budget before you are charged for it. The biggest savings come from the input side, where the same text is paid for on every call: prune system prompts to the instructions that actually change behavior, drop few-shot examples the model no longer needs, and summarize or window long conversation histories instead of resending them verbatim. On the output side, setting a sensible maximum token limit prevents a model from rambling far past what the task requires. None of this changes which model you call — it simply removes tokens you were never getting value from, and on a high-volume workload that discipline often saves more than switching providers ever would.

How token pricing works in practice

API providers quote token prices per one million tokens, which can make the numbers look small and misleading. Translating them into a concrete task clarifies the real economics. Consider a customer-support bot that receives a 200-token system prompt, a 150-token user message, and replies with 250 tokens — a total of 350 input tokens and 250 output tokens per exchange. At GPT-OSS 120B rates ($0.039 per million input, $0.18 per million output as of June 2026), that single exchange costs roughly $0.000014 in input and $0.000045 in output — under a tenth of a cent. Scale to 500,000 exchanges per month and the monthly bill approaches $30. Run the same volume on Claude Opus 4.8 ($5 input and $25 output per million) and the monthly cost rises above $4,000. The task complexity, the required output quality, and the acceptable latency all feed into which model is economically justified. Browsing the cheapest LLM APIs comparison is a good starting point for mapping workloads to tiers before committing to a model.

Frequently asked questions

What is a token in an AI API?

A token is the basic unit a language model uses to process text. In English, one token is roughly three-quarters of a word, so 1,000 tokens correspond to approximately 750 words. Punctuation, spaces, and non-Latin characters tokenize differently, but the rule of about 0.75 words per token is a reliable planning estimate.

Why does output cost more than input?

Generating output requires the model to run a full forward pass for every single token it produces, whereas input is processed in a single parallel pass across the entire prompt. This difference in compute is why providers charge more per output token than per input token across nearly every model.

What is a context window?

The context window is the maximum number of tokens, input plus output combined, that fit in a single API request. If your prompt plus expected reply exceeds the window, you must truncate or summarize older content. Larger windows allow longer documents but can raise input costs if you pass more text than the task needs.

Start building on Zylo

One OpenAI-compatible API for Claude, GPT, Gemini, DeepSeek and more. Free API key, local payments, no card required.

Get free API key