Guide

How Do AI APIs Work? Request to Response Explained

Sending a message to an AI model via an API feels instantaneous from the outside, but a precise sequence of steps runs between the moment your code dispatches a request and the moment a completion arrives. Understanding that sequence helps you write better prompts, interpret costs accurately, optimize for latency, and debug failures when they arise. This article traces the full journey of an AI API call — from the JSON payload you assemble to the token counts that appear on your bill — and explains the mechanics that determine what you pay and how fast you get an answer. If you are new to the concept entirely, start with what an AI API is before reading on.

Assembling and sending the request

An AI API request is an HTTP POST to an endpoint such as /v1/chat/completions. The body is a JSON object that contains at minimum a model identifier and a “messages” array. Each message in that array has a role — typically “system”, “user”, or “assistant” — and a content string. The system message sets context or behavioral instructions; the user message carries the human turn; assistant messages represent prior model replies in a multi-turn conversation. Alongside the messages you include your API key in an Authorization: Bearer header. That key authenticates the request, identifies your account, and triggers billing. Without a valid key the server returns a 401 error immediately. For a full explanation of what keys are and why they must be protected, see what AI API keys are.

Tokenization and the context window

Before the model processes your request, the provider’s infrastructure converts your text into tokens. A token is roughly three-quarters of an average English word; punctuation, spaces, and uncommon words each tokenize differently, but 1,000 tokens is a reasonable proxy for about 750 words. Every model has a context window — a maximum number of tokens it can consider at once, covering both the input you send and the output it generates. If the combined length of your messages exceeds that window the request fails or earlier context is truncated. The sum of input tokens plus output tokens is what appears in the usage object of the response and is what the provider charges against. To see how token math translates into dollar amounts, how AI API tokens work explains the arithmetic in detail.

Model inference and streaming

Once your request clears authentication and tokenization, the provider’s inference infrastructure loads or routes to the requested model and runs a forward pass through billions of parameters. The model generates output one token at a time, sampling from a probability distribution over its vocabulary at each step. In non-streaming mode the provider accumulates all output tokens before returning a single JSON response, which means your client waits for the full completion. In streaming mode — enabled by setting stream: true in the request — the server sends server-sent events as each token (or small batch of tokens) is produced, so your application can begin displaying text within milliseconds of the first token. Streaming does not change what you are billed; it only changes when the bytes arrive at your client. For latency-sensitive applications such as chat interfaces, streaming is almost always the right choice.

The response object and per-token billing

A completed non-streaming response is a JSON object that contains a “choices” array with the model’s reply, a “model” field confirming which model ran, a “finish_reason” indicating whether generation ended naturally or hit a length limit, and a “usage” object reporting prompt tokens, completion tokens, and their sum. Input and output tokens are billed separately because output generation is computationally more expensive than input processing; most providers charge more per output token as a result. As a concrete example, at June 2026 prices on Zylo AI, Claude Haiku 4.5 costs $1 per million input tokens and $5 per million output tokens, while Gemini 2.5 Flash Lite sits at $0.10 input and $0.40 output. Zylo AI charges the base provider rate with no markup on usage; the only platform fee is a flat 25 percent applied when you add credits to your account, not on each call. Full current rates are on the models page, and the developer quickstart shows how to make your first request.

Frequently asked questions

Why do input and output tokens cost different amounts?

Output generation requires the model to sample from its vocabulary one token at a time, which is computationally more intensive than processing an input prompt. Most providers charge more per output token to reflect that difference.

Does streaming change what I am billed?

No. Streaming changes when bytes arrive at your client but not how tokens are counted. You are billed for the same number of input and output tokens whether you use streaming or wait for the full response.

What happens if my request exceeds the context window?

The request either fails with an error or the provider truncates earlier context, depending on the model and configuration. Keeping track of cumulative token counts in a conversation is important for avoiding unexpected truncation.

Start building on Zylo

One OpenAI-compatible API for Claude, GPT, Gemini, DeepSeek and more. Free API key, local payments, no card required.

Get free API key