Chunking#

Learn how to use Termite to split documents into optimal chunks for search indexing and RAG pipelines.

Overview#

Chunking splits long documents into smaller segments that can be:

  • Embedded - Generate vectors for each chunk
  • Indexed - Store chunks in search databases
  • Retrieved - Find relevant passages for RAG

Chunking Methods#

Fixed Chunking#

Simple token-based splitting with overlap:

{
  "text": "Long document text...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50
  }
}

When to use:

  • Simple documents with uniform structure
  • When speed is priority
  • As a fallback method

Semantic Chunking#

ONNX models that split based on content similarity:

{
  "text": "Long document text...",
  "config": {
    "model": "mirth/chonky-mmbert-small-multilingual-1",
    "threshold": 0.5
  }
}

When to use:

  • Documents with natural topic boundaries
  • When chunk quality is priority
  • Multilingual content

Quick Start#

curl -X POST http://localhost:8082/api/chunk \
  -H "Content-Type: application/json" \
  -d '{
    "text": "This is a long document that discusses machine learning...",
    "config": {
      "model": "fixed",
      "target_tokens": 500,
      "overlap_tokens": 50,
      "separator": "\n\n"
    }
  }'

Response#

{
  "chunks": [
    {
      "id": 0,
      "text": "This is a long document...",
      "start_char": 0,
      "end_char": 500
    },
    {
      "id": 1,
      "text": "...that discusses machine learning...",
      "start_char": 450,
      "end_char": 950
    }
  ],
  "model": "fixed",
  "cache_hit": false
}

Configuration Options#

ParameterTypeDescription
modelstringChunking model (fixed or ONNX model name)
target_tokensintTarget chunk size in tokens
overlap_tokensintOverlap between consecutive chunks
separatorstringPreferred split points (e.g., \n\n)
max_chunksintMaximum number of chunks
thresholdfloatSimilarity threshold for semantic chunking

Best Practices#

Choose the Right Chunk Size#

Use CaseTarget TokensOverlap
Semantic search200-30020-50
RAG context500-80050-100
Long-form analysis1000+100-200

Preserve Context#

Use overlap to maintain context across chunk boundaries:

{
  "config": {
    "target_tokens": 500,
    "overlap_tokens": 100
  }
}

Handle Different Document Types#

Markdown:

{
  "config": {
    "separator": "\n## "
  }
}

Code:

{
  "config": {
    "separator": "\n\n"
  }
}

Caching#

Results are cached for 2 minutes. Same text + config returns cached results.

Integration with Antfly#

Configure chunking in Antfly table schema:

enrichers:
  - type: chunk
    config:
      termite_url: http://localhost:8082
      model: fixed
      target_tokens: 500
      overlap_tokens: 50

Example: Document Processing Pipeline#

# 1. Chunk the document
chunks = termite.chunk(
    text=document_text,
    config={
        "model": "fixed",
        "target_tokens": 500,
        "overlap_tokens": 50
    }
)

# 2. Generate embeddings for each chunk
embeddings = termite.embed(
    model="BAAI/bge-small-en-v1.5",
    input=[c.text for c in chunks]
)

# 3. Index chunks with embeddings
for chunk, embedding in zip(chunks, embeddings):
    antfly.insert({
        "content": chunk.text,
        "start_char": chunk.start_char,
        "end_char": chunk.end_char,
        "embedding": embedding
    })

Next Steps#