Stop Taking a Helicopter
to the Grocery Store
Why small, task-specific models outperform LLMs for 80% of your AI pipeline — and how to run them locally for free with Termite.
The LLM-for-Everything Trap
Here's a pattern we see constantly: a team builds a RAG pipeline, and every step — chunking, embedding, reranking, entity extraction — gets routed through an LLM API. GPT-4o for chunking. OpenAI embeddings for vectors. Another LLM call for query rewriting. Maybe Cohere for reranking.
It works. But it's the equivalent of taking a helicopter to the grocery store. You'll get there, sure — but you're burning jet fuel for a task that a bicycle handles better.
LLMs are general-purpose reasoning engines. They're extraordinary at understanding context, generating text, and handling ambiguous instructions. But most of the steps in an AI pipeline aren't ambiguous. Chunking is chunking. Embedding is embedding. Reranking is reranking. These are well-defined, narrow tasks — and there are models specifically trained to do each one faster, cheaper, and often better than a general-purpose LLM.
What Are "Small Models"?
We're talking about models in the 30 million to 300 million parameter range — roughly 100x to 1,000x smaller than models like GPT-4o or Claude. They're typically trained for a single task: generating embeddings, splitting text into semantic chunks, re-scoring search results, or extracting named entities.
What makes them powerful isn't their size — it's their specificity. A 128 MB embedding model that was trained on billions of sentence pairs for one job (producing good vectors) will outperform a 70B parameter model that's trying to do everything. It's the same reason a Formula 1 car beats a helicopter on a racetrack.
Task-by-Task: Where Small Models Win
Let's walk through each step of a typical AI pipeline and see what happens when you swap the LLM for a purpose-built model.
1. Embeddings
Embedding models convert text into dense vectors that capture semantic meaning. This is arguably the most important step in any search or RAG pipeline — and it's also the most wasteful place to use an LLM.
Models like bge-small-en-v1.5 (128 MB, 384 dimensions) or nomic-embed-text-v1.5 (1.2 GB, 768 dimensions) are specifically trained for producing high-quality embeddings. They consistently score at or near the top of the MTEB leaderboard — the standard benchmark for embedding quality — despite being a fraction of the size of general LLMs.
For multimodal use cases, Termite also ships clip-vit-base-patch32 (584 MB) for unified image and text embeddings, and clap-htsat-unfused (2 GB) for audio.
text-embedding-3-small costs $0.02 per million tokens. If you're embedding 1 billion tokens per month (common for enterprise document ingestion), that's $20/month just for embeddings. Running bge-small-en-v1.5 locally on Termite costs $0 — and it's faster, with 1-3ms per embedding vs. 50-200ms for an API round-trip.2. Semantic Chunking
Most teams either chunk by fixed token counts (which breaks mid-sentence) or use an LLM to identify split points (which costs $2.50 per million tokens on GPT-4o). There's a better option: small transformer models trained specifically for document segmentation.
Termite's chonky-mmbert-small-multilingual-1 (570 MB) understands document structure and splits at natural paragraph and topic boundaries. It supports multiple languages and runs in 2-5ms per document — fast enough to chunk millions of documents in hours on a single machine.
3. Reranking
Reranking is the single highest-leverage improvement you can make to a RAG pipeline. A reranker takes your initial search results and re-scores each one by looking at the full query-document pair together (cross-attention), catching relevance signals that embeddings alone miss.
Cohere's Rerank API costs $2.00 per 1,000 searches. Termite's mxbai-rerank-base-v1 (713 MB) provides comparable quality at zero marginal cost. At 100K searches per month, that's $200/month saved.
The real advantage isn't just cost — it's latency. A local reranker adds 5-10ms to your pipeline. An API call adds 100-300ms plus network variability. For user-facing search, that difference is the line between "snappy" and "sluggish."
4. Named Entity Recognition
NER extracts structured information (people, organizations, locations, dates) from unstructured text. This is incredibly useful for building faceted search, populating knowledge graphs, or enriching documents with metadata before indexing.
Instead of prompting GPT-4o with "extract all entities from this text" at $2.50/M tokens, Termite's gliner2-base-v1 (798 MB) handles it locally with customizable label sets. For relationship extraction (who works at what company, what event happened where), rebel-large (2.9 GB) maps entities and their relationships in a single pass.
5. Query Rewriting & Expansion
When a user searches for "quarterly results," they might mean financial reports, academic grades, or sports scores. Query rewriting generates multiple variants to improve recall. Termite's flan-t5-small-squad-qg (569 MB) handles question generation and paraphrasing locally, while pegasus-paraphrase (4.5 GB) offers higher-quality paraphrasing for more demanding use cases.
The Real Cost of LLM-Powered Pipelines
Let's put real numbers on it. Here's what a typical RAG pipeline costs when every step runs through an LLM API — versus running the same pipeline with task-specific models on Termite.
| Pipeline Task | LLM / API Approach | Monthly Cost | Small Model (Termite) | Monthly Cost |
|---|---|---|---|---|
| Embeddings | OpenAI text-embedding-3-small | $20 | bge-small-en-v1.5 (128 MB) | $0 |
| Reranking | Cohere Rerank 3.5 | $200 | mxbai-rerank-base-v1 (713 MB) | $0 |
| Chunking | GPT-4o for semantic splitting | $250 | chonky-mmbert (570 MB) | $0 |
| NER / Extraction | GPT-4o for entity extraction | $125 | gliner2-base-v1 (798 MB) | $0 |
| Query Rewriting | GPT-4o-mini | $15 | flan-t5-small-squad-qg (569 MB) | $0 |
| Total Pipeline | ~$610/mo | $0/mo |
Estimates based on published API pricing (OpenAI, Cohere) as of early 2026. Actual costs vary by volume and provider.
That's ~$610/month in API costs that drops to effectively zero. And the dollar amount is only part of the story. The hidden costs of API-dependent pipelines are just as significant:
The Termite Model Garden
Termite is a local ML inference server that runs ONNX models with an Ollama-compatible API. Think of it as Ollama, but for all the other models in your AI pipeline — the ones that aren't LLMs. Embeddings. Chunking. Reranking. NER. OCR. Rewriting.
It ships with a curated model garden of task-specific models, all optimized for CPU inference with FP16 and INT8 quantization options. One command to pull a model, one command to run it.
Small Models in Your Pipeline — All Running on Termite
Total model footprint: ~2.8 GB — all five models fit on a laptop with room to spare
# Pull and run models with Termite — just like Ollama
termite pull bge-small-en-v1.5 # Embeddings (128 MB)
termite pull mxbai-rerank-base-v1 # Reranking (713 MB)
termite pull chonky-mmbert-small-multilingual-1 # Chunking (570 MB)
termite pull gliner2-base-v1 # NER (798 MB)
# Start serving all models
termite run --models-dir ./models
# Use the Ollama-compatible API
curl http://localhost:11435/api/embed \
-d '{"model": "bge-small-en-v1.5", "input": "Hello world"}'Available Models
Every model in the Termite garden runs locally via ONNX Runtime, with optional XLA and Go backends. Browse the full catalog at antfly.io/termite/models.
| Category | Model | Size | What It Does |
|---|---|---|---|
| Embedders | bge-small-en-v1.5 | 128 MB | Fast, high-quality text embeddings (384d) |
| Embedders | nomic-embed-text-v1.5 | 1.2 GB | Larger context window (8192 tokens, 768d) |
| Embedders | clip-vit-base-patch32 | 584 MB | Multimodal — unified image + text embeddings |
| Embedders | clap-htsat-unfused | 2 GB | Audio embeddings |
| Embedders | splade-cocondenser | 508 MB | Sparse embeddings for learned sparse retrieval |
| Chunkers | chonky-mmbert-small-multilingual-1 | 570 MB | Semantic chunking, multilingual |
| Rerankers | mxbai-rerank-base-v1 | 713 MB | Cross-encoder reranking for search results |
| Recognizers | gliner2-base-v1 | 798 MB | NER with customizable label sets |
| Recognizers | rebel-large | 2.9 GB | Relation extraction (entity + relationship) |
| Rewriters | flan-t5-small-squad-qg | 569 MB | Question generation and query rewriting |
| Rewriters | pegasus-paraphrase | 4.5 GB | High-quality text paraphrasing |
| Readers | paddleocr-onnx | 9.8 MB | OCR for scanned documents and images |
| Generators | functiongemma-270m-it | 1.1 GB | Small local text generation (tool/function calling) |
Integrating Small Models Into Your Pipeline
Termite is designed to slot into existing workflows. If you're already using Antfly, small models are built directly into the database — just configure your index with an embedder, chunker, and reranker. If you're using Termite standalone, it exposes an Ollama-compatible API that any HTTP client can call.
With Antfly (built-in)
Small models run automatically as part of your index configuration. No separate service to manage.
# Create an index that uses local models for everything
antfly cli table create --table documents \
--index '{
"name": "smart_index",
"type": "aknn_v0",
"template": "{{title}} {{body}}",
"embedder": {
"provider": "antfly",
"model": "bge-small-en-v1.5"
},
"chunker": {
"provider": "antfly",
"text": { "target_tokens": 200, "overlap_tokens": 25 }
}
}'
# Query with local reranking and pruning
antfly cli query --table documents \
--semantic-search "quarterly revenue trends" \
--indexes "smart_index" \
--reranker '{ "provider": "antfly", "field": "body" }' \
--pruner '{"min_score_ratio": 0.5}'Standalone Termite
Run Termite as an independent service and call it from any language or framework.
import requests
# Generate embeddings
resp = requests.post("http://localhost:11435/api/embed", json={
"model": "bge-small-en-v1.5",
"input": ["First document", "Second document"]
})
embeddings = resp.json()["embeddings"]
# Rerank search results
resp = requests.post("http://localhost:11435/api/rerank", json={
"model": "mxbai-rerank-base-v1",
"query": "quarterly revenue",
"documents": ["Q3 financials...", "Team lunch menu..."]
})
ranked = resp.json()["results"] # sorted by relevanceWhere LLMs Still Win
This isn't an argument against LLMs. It's an argument for using the right tool for the job. LLMs are unmatched for tasks that require reasoning, creativity, and handling ambiguity.
- Embedding generation
- Document chunking
- Reranking search results
- Named entity extraction
- Query rewriting / expansion
- OCR / document reading
- Sparse retrieval (SPLADE)
- Generating natural language answers
- Complex reasoning & synthesis
- Multi-step planning (agents)
- Conversational interfaces
- Creative content generation
- Summarization with nuance
- Evaluating output quality (judge)
Start in 5 Minutes
Termite is a single binary. Install it, pull the models you need, and start serving. No Docker, no GPU, no configuration files.
# Install Termite
brew install antfly-io/tap/termite
# Pull a few models
termite pull bge-small-en-v1.5
termite pull mxbai-rerank-base-v1
# Start serving
termite run
# That's it. Embeddings at localhost:11435/api/embed
# Reranking at localhost:11435/api/rerankOr, if you're using Antfly, Termite is already built in — just start Antfly Swarm and configure your indexes to use local models. The database handles model lifecycle, caching, and distribution automatically.
Stop overpaying for your AI pipeline
Browse the Termite model garden, pull the models you need, and start running inference locally in minutes.