What Is a Context Window?
A context window (also called context length) is the maximum number of tokens that a large language model can process in a single interaction. It defines the total "working memory" of the model — encompassing the system prompt, conversation history, retrieved documents, and the generated response. Any information beyond the context window is invisible to the model.
Why Context Windows Matter
The context window directly impacts what an LLM can do:
- Small window (4K tokens) — Can handle short conversations and simple queries
- Medium window (32K-128K tokens) — Can process long documents, code files, or extended conversations
- Large window (200K-1M+ tokens) — Can analyze entire books, codebases, or massive document collections
Context Window Sizes
| Model | Context Window | Approximate Pages |
|---|---|---|
| GPT-3.5 | 4K / 16K tokens | 3–12 pages |
| GPT-4 | 128K tokens | ~96 pages |
| Claude 3.5 Sonnet | 200K tokens | ~150 pages |
| Gemini 1.5 Pro | 1M+ tokens | ~750+ pages |
| LLaMA 3 | 128K tokens | ~96 pages |
How Context Windows Work
Input + Output = Total Context
The context window includes everything — your prompt, system instructions, conversation history, retrieved documents, AND the model's response. A 128K context window means the sum of all input and output tokens cannot exceed 128K.
Attention Mechanism
The transformer architecture processes all tokens in the context window through self-attention, where every token can attend to every other token. This is why longer context windows are computationally expensive — the cost scales quadratically with length.
Context Window vs. Memory
LLMs have no true long-term memory — the context window is their entire "working memory." Once a conversation exceeds the context window, earlier messages are either:
- Truncated (removed from the start)
- Summarized (compressed into shorter form)
- Lost entirely
Managing Context Effectively
- Retrieval-Augmented Generation (RAG) — Retrieve only relevant information instead of stuffing everything into context
- Conversation Summarization — Periodically summarize long conversations to save space
- Chunking — Break large documents into relevant sections and only include what's needed
- System Prompt Optimization — Keep system instructions concise
- Priority Ordering — Place the most important information where the model attends most strongly
The "Lost in the Middle" Problem
Research shows that LLMs pay more attention to information at the beginning and end of the context window, sometimes missing crucial details in the middle. This means placement of information within the context matters, not just whether it's included.