Chunking
Learn how to use Termite to split documents into optimal chunks for search indexing and RAG pipelines.
Overview
Chunking splits long documents into smaller segments that can be:
- Embedded - Generate vectors for each chunk
- Indexed - Store chunks in search databases
- Retrieved - Find relevant passages for RAG
Chunking Methods
Fixed Chunking
Simple token-based splitting with overlap:
{
"text": "Long document text...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50
}
}When to use:
- Simple documents with uniform structure
- When speed is priority
- As a fallback method
Semantic Chunking
ONNX models that split based on content similarity:
{
"text": "Long document text...",
"config": {
"model": "mirth/chonky-mmbert-small-multilingual-1",
"threshold": 0.5
}
}When to use:
- Documents with natural topic boundaries
- When chunk quality is priority
- Multilingual content
Quick Start
curl -X POST http://localhost:8082/api/chunk \
-H "Content-Type: application/json" \
-d '{
"text": "This is a long document that discusses machine learning...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n"
}
}'Response
{
"chunks": [
{
"id": 0,
"text": "This is a long document...",
"start_char": 0,
"end_char": 500
},
{
"id": 1,
"text": "...that discusses machine learning...",
"start_char": 450,
"end_char": 950
}
],
"model": "fixed",
"cache_hit": false
}Configuration Options
| Parameter | Type | Description |
|---|---|---|
model | string | Chunking model (fixed or ONNX model name) |
target_tokens | int | Target chunk size in tokens |
overlap_tokens | int | Overlap between consecutive chunks |
separator | string | Preferred split points (e.g., \n\n) |
max_chunks | int | Maximum number of chunks |
threshold | float | Similarity threshold for semantic chunking |
Best Practices
Choose the Right Chunk Size
| Use Case | Target Tokens | Overlap |
|---|---|---|
| Semantic search | 200-300 | 20-50 |
| RAG context | 500-800 | 50-100 |
| Long-form analysis | 1000+ | 100-200 |
Preserve Context
Use overlap to maintain context across chunk boundaries:
{
"config": {
"target_tokens": 500,
"overlap_tokens": 100
}
}Handle Different Document Types
Markdown:
{
"config": {
"separator": "\n## "
}
}Code:
{
"config": {
"separator": "\n\n"
}
}Caching
Results are cached for 2 minutes. Same text + config returns cached results.
Integration with Antfly
Configure chunking in Antfly table schema:
enrichers:
- type: chunk
config:
termite_url: http://localhost:8082
model: fixed
target_tokens: 500
overlap_tokens: 50Example: Document Processing Pipeline
# 1. Chunk the document
chunks = termite.chunk(
text=document_text,
config={
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50
}
)
# 2. Generate embeddings for each chunk
embeddings = termite.embed(
model="BAAI/bge-small-en-v1.5",
input=[c.text for c in chunks]
)
# 3. Index chunks with embeddings
for chunk, embedding in zip(chunks, embeddings):
antfly.insert({
"content": chunk.text,
"start_char": chunk.start_char,
"end_char": chunk.end_char,
"embedding": embedding
})Next Steps
- API Reference - Chunking API details
- Embedding Models - Generate embeddings
- Reranking - Improve search relevance