Invalid input. Special characters are not supported.
This blog outlines five essential concepts that explain how large language models process input within a context window. Using clear examples and practical insights, it covers foundational ideas like tokenization, sequence length and attention. The goal is to help readers better understand how context affects model behavior in AI applications. We also present results from an analytical model used to estimate system behavior, to show how scaling input and output sequence lengths impacts response time. The results highlight how decoding longer outputs takes significantly more time, pointing to the importance of fast memory systems like HBM in supporting efficient inference at scale. These concepts are useful for anyone working with or designing prompts for generative AI systems.
Context window versus length
When working with large language models, it’s important to understand the difference between concepts like context window, context length and sequence length. These terms are often used interchangeably, which can lead to confusion. In this blog we will define and refer to them as distinct concepts.
The context window is the model’s maximum capacity: the total number of tokens it can process at once, including both your input and the model’s output. As a simple example, let’s define the rectangle size below as equivalent to a 100,000 token context window.
The context length, on the other hand, is how much you've put into that space, which is the actual number of tokens—input tokens (blue) and output tokens (green)—currently in use during a conversation. For example, if a model has a 100,000-token context window and your input uses 75,000 tokens, only 25,000 tokens remain for the model’s response before it reaches the upper limit of the window.
Sequence length typically refers to the length of a single input or output sequence within that window. It’s a more granular measure used in model training and inference to track the length of each segment of text.
The context window sets the limit for how much information a model can process, but it does not directly reflect intelligence. A larger window allows more input, yet the quality of the output often depends on how well that input is structured and used. Once the window is full, the model may lose coherence, leading to unwanted outcomes (for example, hallucinations).
Tokens aren't words
If the context window is defined by an upper limit (say 100,000), tokens are the units that measure what fits inside and it’s important to understand that tokens are not words. The words you type into a prompt are fed to a “tokenizer,” which breaks down text into tokens. A single word may be split into several tokens. For example, “strawberry” becomes three tokens and “trifle” becomes two. In other cases, a word may consist of just one token like “cake”.
    
    
    
We can test this with a quote from the novel “Emma” by Jane Austen.
“Seldom, very seldom, does complete truth belong to any human disclosure; seldom can it happen that something is not a little disguised or a little mistaken.”
This text contains 26 words and when run through the tokenizer of the Mistral language model provided by lunary.ai1, it produces 36 tokens. That's about 0.72 words per token or roughly three-fourths of a word.
    
    
    
The ratio varies, but for English words you might average around 0.75 words per token. That’s why a model with a 100,000-token context window (per user) does not necessarily fit 100,000 words. In practice, you might fit closer to 75,000 English words or fewer, depending on the text.
estimatedtokens≈words∗1.33
To further check the token-to-word ratio at scale, we ran a quick analysis using eight well-known literary works from Project Gutenberg, a library of more than 75,000 free e-books. First, we counted the words in each book, then ran the texts through a tokenizer to get the token counts. After comparing the numbers, we found that the average ratio was about 0.75 words per token.
Knowing this ratio can help everyday users get more out of their interactions with AI. Most AI platforms, like ChatGPT or Claude, operate with token-based constraints. That is, they process text in tokens, not words, so it’s easy to misjudge how much content you can actually fit into a prompt or response. Because usage is often measured in tokens rather than words, knowing the ratio makes you aware of any limits so you can plan your inputs more strategically. For example, if a model has a 4,000-token input limit, that’s roughly 3,000 words. This is good to know when feeding a model a long document or dataset for tasks like finding key insights or answering questions.
Attention is not equally distributed within the context window
AI hallucinations are often misunderstood as quirky behavior or signs that a language model is buggy and unreliable. But hallucinations are not random; they often stem from how a model might process and prioritize information, which is determined by things like how well a model is trained and how it distributes attention. In transformer-based models like GPT or Claude, attention is the mechanism that helps the model decide which parts of the context are most relevant when generating a response. To better understand the concept of attention, imagine being at a noisy cocktail party. If someone calls your name, you instinctively tune in.
“Frodo! Over here!”
But what if four people call your name at once from different corners of the room?
“Frodo! It’s me, Sam!”
“Frodo! Come quick!”
“Frodo! Look this way.”
“Frodo ... yesss, precious Frodo ...”
You hear them all, but your focus is now split. You might even pay more attention to the voice you recognize or the one closest to you. Each sound gets a fraction of your attention, but not all equally. It’s not a perfect analogy but this is one way you can conceive of how attention works in large language models. The model pays attention to all tokens in the context window, but it gives more weight to some than to others. And that’s why attention in large language models is often described as “weighted”, meaning that not all tokens are treated equally. This uneven distribution is key to understanding how models might prioritize information and why they sometimes appear to lose focus.
More context may or may not mean better answers
A model can scan all tokens within the context window, but it doesn’t consider each token with equal interest. As the window fills (say, up to 100,000 tokens), the model’s attention becomes more diffuse. In its attempt to keep track of everything, clarity may diminish.
When this happens, the model’s grip on the conversation loosens, and a user might experience slower, less coherent responses or confusion between earlier and later parts of the conversation thread. Hallucinations, from the Latin hallucinat or “gone astray in thought,” often appear at this edge. It’s important to understand that these occurrences are not signs that the model is malfunctioning. It is actually an indication that the model is reaching its threshold, where it is operating at capacity. And here is where the model may struggle to maintain coherence or relevance across long spans of input.
From the model’s perspective, earlier tokens are still visible. But as the window fills up and its attention becomes more distributed, the precision of response may degrade. The model might misattribute facts from previous prompts or fuse unrelated ideas into something that sounds coherent but isn’t. In the case of hallucinations, the model isn’t lying. It’s reaching for a reasonable response from fragments it can no longer fully distinguish, making a guess under the strain of limited attention. And to be fair, the model is working with what it has, trying to make sense of a conversation that’s grown too big to reliably focus on. Understanding attention in this way helps explain why more context doesn’t always lead to better answers.2
That said, long context windows (greater than 200,000 and now reaching 1 million or more tokens) can be genuinely useful, especially for complex reasoning and emerging applications like video processing. Newer models are being trained to handle longer contexts more effectively. With better architecture and training, models can more effectively manage attention across inputs, reducing hallucinations and improving responses. So, while more context doesn’t always lead to better answers, newer models are getting better at staying focused, even when the conversation gets really long.
Sequence length affects response time
Following the explanation of attention, it’s useful to understand how sequence length affects inference. We can now ask a practical question: What happens when we vary the sequence length?
The input sequence length affects time to first token (TTFT), the time from entering the request to receiving the first output token. TTFT matters most for GPU performance, as it reflects how quickly the GPU can process the input and then compute it to output the first token. In contrast, varying the output sequence length affects inter-token latency (ITL) or the time between each generated token.b This latency is more relevant to memory usage.
To explore this further, we used a first-order analytical model to estimate end-to-end latency during LLM inference. We ran the model using Llama 3 70B on a single GPU with high-bandwidth memory (HBM3E 12H, 36GB across 8 placements), and a context window of 128,000 tokens.c
b Key inference metrics: Time to first-token (TTFT): how long it takes for the model to begin generating output after receiving input (prefill performance). Inter-token latency (ITL): the time between each generated token (decode performance). End-to-end latency: the time it takes from submitting your query to receiving a complete response.3
c Performance estimates are based on an in-house analytical model designed to approximate inference behavior. The system modeled in this study is based on an estimated GPU configuration that reflects general characteristics of commercially available hardware platforms. While not representative of any specific product, the configuration was selected to support the technical objectives of the analysis. These estimates do not reflect optimized software or hardware configurations and may differ from real-world results.
The chart below shows the impact of increasing input sequence length (ISL) and output sequence length (OSL) on the entire end-to-end latency. Each measurement was taken with a batch size of 1 (i.e., a single request). less energy.
    
    
    
Figure 5. End-to-end latency per user (seconds), for both output and input sequence lengths
Key takeaways
One important takeaway when measuring latency is that it takes much more time for the model to generate a long response than to process a long prompt. The model can read and understand the input all at once, which is relatively fast even for lengthy prompts. But, generating a response happens token by token, with each new token depending on everything generated so far. This takes more time because the model follows an autoregressive process, meaning each token is built on the ones before it. For example, increasing the input sequence length (ISL) from 2,000 to 125,000 tokens results in only a roughly two times increase in latency. In contrast, scaling the output sequence length (OSL) across the same range leads to a roughly 68 times increase.d This difference arises because longer input sequences drive more prefill computation, which can process multiple tokens in parallel. Meanwhile, decode is inherently sequential, generating one token at a time, and that takes more time and demands much more memory bandwidth.
The implication is that longer output sequences result in longer decode times and that means the GPU and memory subsystem remain active longer. In this context, power efficiency at the hardware level becomes especially valuable. A memory device like Micron HBM3Ee that runs using much less power than comparable high-bandwidth memory devices can complete identical inference tasks while using less energy.
d The estimates presented here come from an analytical model without optimization, so they illustrate general trends rather than peak performance.
e Micron HBM3E consumes 30% less power than comparable high-bandwidth memory devices on the market.
For a user, this insight underscores the importance of optimizing prompts and managing input length (trimming any unnecessary content, for example). And if you’re building real-time apps, you can usually handle longer inputs without much trouble. But keeping the output concise may help your system stay fast and responsive.
The important role of memory for context length
Inference latency depends not only on sequence length but also on how the system manages the demands on compute and memory as it processes inputs and generates outputs. Many recently released language models now advertise context windows that exceed one million tokens. These larger context windows (when fully utilized) place greater stress on the memory subsystem, which may appear to the user as slower execution and increased runtimes. Newer memory technologies will offer higher bandwidth and larger capacity to support these larger context windows, improving response times and overall throughput (tokens per second). But these performance gains raise questions about energy use. As inference workloads scale to millions of tokens, designing systems that use power efficiently becomes increasingly important. Systems that remain active for longer periods require more power, and memory devices designed to use less power without sacrificing bandwidth can help address this challenge. For example, Micron HBM3E consumes much less power than competing high-bandwidth memory devices. And this lower power can help reduce the amount of energy AI consumes during inference workloads involving millions of tokens. Looking ahead, next-generation memory technologies, like HBM4 and HBM4E, are being designed to deliver even higher memory bandwidth and capacity while improving power efficiency. These improvements, which stem from advances in process technology (Micron’s use of 1-gamma DRAM), are expected to enable faster data movement with lower energy cost. Moreover, as these technologies mature, they may further reduce latency and improve throughput and energy use in large-scale AI deployments.
Learn more
https://www.micron.com/educatorhub
https://www.micron.com/educatorhub/courses/what-does-it-mean-for-ai-to-know-something
1 https://lunary.ai/openai-tokenizer
2 https://cs.stanford.edu/~nfliu/papers/lost-in-the-middle.arxiv2023.pdf
3 https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html