Chunking#

Learn how to use Termite to split documents into optimal chunks for search indexing and RAG pipelines.

Overview#

Chunking splits long documents into smaller segments that can be:

Embedded - Generate vectors for each chunk
Indexed - Store chunks in search databases
Retrieved - Find relevant passages for RAG

Chunking Methods#

Fixed Chunking#

Simple token-based splitting with overlap:

{
  "text": "Long document text...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50
  }
}

When to use:

Simple documents with uniform structure
When speed is priority
As a fallback method

Semantic Chunking#

ONNX models that split based on content similarity:

{
  "text": "Long document text...",
  "config": {
    "model": "mirth/chonky-mmbert-small-multilingual-1",
    "threshold": 0.5
  }
}

When to use:

Documents with natural topic boundaries
When chunk quality is priority
Multilingual content

Quick Start#

curl -X POST http://localhost:8082/api/chunk \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a long document that discusses machine learning...",
    "config": {
      "model": "fixed",
      "target_tokens": 500,
      "overlap_tokens": 50,
      "separator": "\n\n"
    }
  }'

Response#

{
  "chunks": [
    {
      "id": 0,
      "text": "This is a long document...",
      "start_char": 0,
      "end_char": 500
    },
    {
      "id": 1,
      "text": "...that discusses machine learning...",
      "start_char": 450,
      "end_char": 950
    }
  ],
  "model": "fixed",
  "cache_hit": false
}

Configuration Options#

Parameter	Type	Description
`model`	string	Chunking model (`fixed` or ONNX model name)
`target_tokens`	int	Target chunk size in tokens
`overlap_tokens`	int	Overlap between consecutive chunks
`separator`	string	Preferred split points (e.g., `\n\n`)
`max_chunks`	int	Maximum number of chunks
`threshold`	float	Similarity threshold for semantic chunking

Best Practices#

Choose the Right Chunk Size#

Use Case	Target Tokens	Overlap
Semantic search	200-300	20-50
RAG context	500-800	50-100
Long-form analysis	1000+	100-200

Preserve Context#

Use overlap to maintain context across chunk boundaries:

{
  "config": {
    "target_tokens": 500,
    "overlap_tokens": 100
  }
}

Handle Different Document Types#

Markdown:

{
  "config": {
    "separator": "\n## "
  }
}

Code:

{
  "config": {
    "separator": "\n\n"
  }
}

Caching#

Results are cached for 2 minutes. Same text + config returns cached results.

Integration with Antfly#

Configure chunking in Antfly table schema:

enrichers:
  - type: chunk
    config:
      termite_url: http://localhost:8082
      model: fixed
      target_tokens: 500
      overlap_tokens: 50

Example: Document Processing Pipeline#

# 1. Chunk the document
chunks = termite.chunk(
    text=document_text,
    config={
        "model": "fixed",
        "target_tokens": 500,
        "overlap_tokens": 50
    }
)

# 2. Generate embeddings for each chunk
embeddings = termite.embed(
    model="BAAI/bge-small-en-v1.5",
    input=[c.text for c in chunks]
)

# 3. Index chunks with embeddings
for chunk, embedding in zip(chunks, embeddings):
    antfly.insert({
        "content": chunk.text,
        "start_char": chunk.start_char,
        "end_char": chunk.end_char,
        "embedding": embedding
    })

Next Steps#

API Reference - Chunking API details
Embedding Models - Generate embeddings
Reranking - Improve search relevance