Cookie Preferences

    We use cookies to enhance your browsing experience, analyze site traffic, and personalize content. By clicking "Accept All", you consent to our use of cookies. Learn more

    AI Techniques
    techniques

    What Is Chunking?

    AsterMind Team

    Chunking is the process of breaking down large documents into smaller, manageable segments (chunks) that can be individually embedded and retrieved in AI systems. It's a critical step in Retrieval-Augmented Generation (RAG) pipelines — the quality of chunking directly impacts the relevance and accuracy of retrieved information.

    Why Chunking Matters

    Documents can be thousands of pages long, but embedding models and LLM context windows have limits. Chunking solves this by:

    • Enabling embedding — Most embedding models have token limits (512-8192 tokens)
    • Improving precision — Smaller chunks return more targeted, relevant results
    • Managing context — Only relevant portions are sent to the LLM, saving tokens and cost

    Chunking Strategies

    Fixed-Size Chunking

    Split documents into chunks of a predetermined size (e.g., 500 tokens).

    • Pros: Simple, consistent chunk sizes
    • Cons: May split sentences or ideas mid-thought

    Recursive Character/Text Splitting

    Split by paragraphs first, then sentences, then words if chunks are still too large.

    • Pros: Respects natural text boundaries
    • Cons: Variable chunk sizes

    Semantic Chunking

    Use an embedding model to detect topic boundaries and split at semantic shifts.

    • Pros: Preserves topical coherence
    • Cons: More computationally expensive

    Document-Structure-Based

    Split based on document structure (headings, sections, pages).

    • Pros: Preserves document organization
    • Cons: Sections may vary greatly in size

    Agentic/Contextual Chunking

    Use an LLM to intelligently decide how to split and add context summaries to each chunk.

    • Pros: Highest quality, preserves context
    • Cons: Slow and expensive at scale

    Key Chunking Parameters

    Parameter Description Typical Range
    Chunk Size Number of tokens per chunk 256-1024 tokens
    Chunk Overlap Shared tokens between adjacent chunks 50-200 tokens
    Separator What to split on (paragraph, sentence, character) Varies

    Impact on Retrieval Quality

    Chunk Size Precision Recall Best For
    Small (128-256 tokens) High Lower Specific factual queries
    Medium (512-1024 tokens) Balanced Balanced General Q&A
    Large (1024-2048 tokens) Lower Higher Complex, multi-fact queries

    Best Practices

    1. Add Overlap — Prevents losing context at chunk boundaries
    2. Preserve Metadata — Keep source document, page number, section title with each chunk
    3. Test and Iterate — Evaluate retrieval quality with different chunk sizes for your specific data
    4. Consider Multi-Strategy — Different document types may benefit from different chunking approaches
    5. Add Context — Prepend section titles or document summaries to each chunk

    Further Reading