Cookie Preferences

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept All", you consent to our use of cookies. Learn more

    AI Architecture
    architecture

    What Is a Context Window?

    AsterMind Team

    A context window (also called context length) is the maximum number of tokens that a large language model can process in a single interaction. It defines the total "working memory" of the model — encompassing the system prompt, conversation history, retrieved documents, and the generated response. Any information beyond the context window is invisible to the model.

    Why Context Windows Matter

    The context window directly impacts what an LLM can do:

    • Small window (4K tokens) — Can handle short conversations and simple queries
    • Medium window (32K-128K tokens) — Can process long documents, code files, or extended conversations
    • Large window (200K-1M+ tokens) — Can analyze entire books, codebases, or massive document collections

    Context Window Sizes

    Model Context Window Approximate Pages
    GPT-3.5 4K / 16K tokens 3–12 pages
    GPT-4 128K tokens ~96 pages
    Claude 3.5 Sonnet 200K tokens ~150 pages
    Gemini 1.5 Pro 1M+ tokens ~750+ pages
    LLaMA 3 128K tokens ~96 pages

    How Context Windows Work

    Input + Output = Total Context

    The context window includes everything — your prompt, system instructions, conversation history, retrieved documents, AND the model's response. A 128K context window means the sum of all input and output tokens cannot exceed 128K.

    Attention Mechanism

    The transformer architecture processes all tokens in the context window through self-attention, where every token can attend to every other token. This is why longer context windows are computationally expensive — the cost scales quadratically with length.

    Context Window vs. Memory

    LLMs have no true long-term memory — the context window is their entire "working memory." Once a conversation exceeds the context window, earlier messages are either:

    • Truncated (removed from the start)
    • Summarized (compressed into shorter form)
    • Lost entirely

    Managing Context Effectively

    • Retrieval-Augmented Generation (RAG) — Retrieve only relevant information instead of stuffing everything into context
    • Conversation Summarization — Periodically summarize long conversations to save space
    • Chunking — Break large documents into relevant sections and only include what's needed
    • System Prompt Optimization — Keep system instructions concise
    • Priority Ordering — Place the most important information where the model attends most strongly

    The "Lost in the Middle" Problem

    Research shows that LLMs pay more attention to information at the beginning and end of the context window, sometimes missing crucial details in the middle. This means placement of information within the context matters, not just whether it's included.

    Further Reading