Engineering Blog8 min readAI Infrastructure

Stop Taking a Helicopter
to the Grocery Store

Why small, task-specific models outperform LLMs for 80% of your AI pipeline — and how to run them locally for free with Termite.

The LLM-for-Everything Trap

Here's a pattern we see constantly: a team builds a RAG pipeline, and every step — chunking, embedding, reranking, entity extraction — gets routed through an LLM API. GPT-4o for chunking. OpenAI embeddings for vectors. Another LLM call for query rewriting. Maybe Cohere for reranking.

It works. But it's the equivalent of taking a helicopter to the grocery store. You'll get there, sure — but you're burning jet fuel for a task that a bicycle handles better.

LLMs are general-purpose reasoning engines. They're extraordinary at understanding context, generating text, and handling ambiguous instructions. But most of the steps in an AI pipeline aren't ambiguous. Chunking is chunking. Embedding is embedding. Reranking is reranking. These are well-defined, narrow tasks — and there are models specifically trained to do each one faster, cheaper, and often better than a general-purpose LLM.

🚁
Using an LLM
for embeddings, chunking, reranking...
Latency
200-800ms
Cost per 1M tokens
$0.02-0.13
Privacy
Data leaves infra
Parameters
7B-70B+
🚲
Small, Task-Specific Model
purpose-built ONNX model on Termite
Latency
1-15ms
Cost per 1M tokens
$0.00 (local)
Privacy
Never leaves machine
Parameters
30M-300M

What Are "Small Models"?

We're talking about models in the 30 million to 300 million parameter range — roughly 100x to 1,000x smaller than models like GPT-4o or Claude. They're typically trained for a single task: generating embeddings, splitting text into semantic chunks, re-scoring search results, or extracting named entities.

What makes them powerful isn't their size — it's their specificity. A 128 MB embedding model that was trained on billions of sentence pairs for one job (producing good vectors) will outperform a 70B parameter model that's trying to do everything. It's the same reason a Formula 1 car beats a helicopter on a racetrack.

10-100x
Faster inference
1-15ms vs 200-800ms per call. Small models run in microseconds on CPU.
100%
Cost reduction
Local ONNX inference on hardware you already own. No API bills, no per-token charges.
<3 GB
Total footprint
Five task-specific models — embedder, chunker, reranker, NER, rewriter — fit in under 3 GB.

Task-by-Task: Where Small Models Win

Let's walk through each step of a typical AI pipeline and see what happens when you swap the LLM for a purpose-built model.

1. Embeddings

Embedding models convert text into dense vectors that capture semantic meaning. This is arguably the most important step in any search or RAG pipeline — and it's also the most wasteful place to use an LLM.

Models like bge-small-en-v1.5 (128 MB, 384 dimensions) or nomic-embed-text-v1.5 (1.2 GB, 768 dimensions) are specifically trained for producing high-quality embeddings. They consistently score at or near the top of the MTEB leaderboard — the standard benchmark for embedding quality — despite being a fraction of the size of general LLMs.

For multimodal use cases, Termite also ships clip-vit-base-patch32 (584 MB) for unified image and text embeddings, and clap-htsat-unfused (2 GB) for audio.

The math at scale: OpenAI's text-embedding-3-small costs $0.02 per million tokens. If you're embedding 1 billion tokens per month (common for enterprise document ingestion), that's $20/month just for embeddings. Running bge-small-en-v1.5 locally on Termite costs $0 — and it's faster, with 1-3ms per embedding vs. 50-200ms for an API round-trip.

2. Semantic Chunking

Most teams either chunk by fixed token counts (which breaks mid-sentence) or use an LLM to identify split points (which costs $2.50 per million tokens on GPT-4o). There's a better option: small transformer models trained specifically for document segmentation.

Termite's chonky-mmbert-small-multilingual-1 (570 MB) understands document structure and splits at natural paragraph and topic boundaries. It supports multiple languages and runs in 2-5ms per document — fast enough to chunk millions of documents in hours on a single machine.

3. Reranking

Reranking is the single highest-leverage improvement you can make to a RAG pipeline. A reranker takes your initial search results and re-scores each one by looking at the full query-document pair together (cross-attention), catching relevance signals that embeddings alone miss.

Cohere's Rerank API costs $2.00 per 1,000 searches. Termite's mxbai-rerank-base-v1 (713 MB) provides comparable quality at zero marginal cost. At 100K searches per month, that's $200/month saved.

The real advantage isn't just cost — it's latency. A local reranker adds 5-10ms to your pipeline. An API call adds 100-300ms plus network variability. For user-facing search, that difference is the line between "snappy" and "sluggish."

4. Named Entity Recognition

NER extracts structured information (people, organizations, locations, dates) from unstructured text. This is incredibly useful for building faceted search, populating knowledge graphs, or enriching documents with metadata before indexing.

Instead of prompting GPT-4o with "extract all entities from this text" at $2.50/M tokens, Termite's gliner2-base-v1 (798 MB) handles it locally with customizable label sets. For relationship extraction (who works at what company, what event happened where), rebel-large (2.9 GB) maps entities and their relationships in a single pass.

5. Query Rewriting & Expansion

When a user searches for "quarterly results," they might mean financial reports, academic grades, or sports scores. Query rewriting generates multiple variants to improve recall. Termite's flan-t5-small-squad-qg (569 MB) handles question generation and paraphrasing locally, while pegasus-paraphrase (4.5 GB) offers higher-quality paraphrasing for more demanding use cases.

The Real Cost of LLM-Powered Pipelines

Let's put real numbers on it. Here's what a typical RAG pipeline costs when every step runs through an LLM API — versus running the same pipeline with task-specific models on Termite.

Pipeline TaskLLM / API ApproachMonthly CostSmall Model (Termite)Monthly Cost
EmbeddingsOpenAI text-embedding-3-small$20bge-small-en-v1.5 (128 MB)$0
RerankingCohere Rerank 3.5$200mxbai-rerank-base-v1 (713 MB)$0
ChunkingGPT-4o for semantic splitting$250chonky-mmbert (570 MB)$0
NER / ExtractionGPT-4o for entity extraction$125gliner2-base-v1 (798 MB)$0
Query RewritingGPT-4o-mini$15flan-t5-small-squad-qg (569 MB)$0
Total Pipeline~$610/mo$0/mo

Estimates based on published API pricing (OpenAI, Cohere) as of early 2026. Actual costs vary by volume and provider.

That's ~$610/month in API costs that drops to effectively zero. And the dollar amount is only part of the story. The hidden costs of API-dependent pipelines are just as significant:

Latency compounding
Five API calls at 200ms each adds a full second to every query. Small models add 20-50ms total.
Data leaves your infrastructure
Every API call sends your documents to a third party. For healthcare, legal, and financial data, this may not be an option.
Rate limits & outages
API rate limits throttle your throughput. Provider outages take your entire pipeline down.
Non-deterministic results
LLM outputs vary between calls. Small models produce consistent, reproducible results every time.

The Termite Model Garden

Termite is a local ML inference server that runs ONNX models with an Ollama-compatible API. Think of it as Ollama, but for all the other models in your AI pipeline — the ones that aren't LLMs. Embeddings. Chunking. Reranking. NER. OCR. Rewriting.

It ships with a curated model garden of task-specific models, all optimized for CPU inference with FP16 and INT8 quantization options. One command to pull a model, one command to run it.

Small Models in Your Pipeline — All Running on Termite

Total model footprint: ~2.8 GB — all five models fit on a laptop with room to spare

# Pull and run models with Termite — just like Ollama
termite pull bge-small-en-v1.5        # Embeddings (128 MB)
termite pull mxbai-rerank-base-v1     # Reranking (713 MB)
termite pull chonky-mmbert-small-multilingual-1  # Chunking (570 MB)
termite pull gliner2-base-v1          # NER (798 MB)

# Start serving all models
termite run --models-dir ./models

# Use the Ollama-compatible API
curl http://localhost:11435/api/embed \
  -d '{"model": "bge-small-en-v1.5", "input": "Hello world"}'

Available Models

Every model in the Termite garden runs locally via ONNX Runtime, with optional XLA and Go backends. Browse the full catalog at antfly.io/termite/models.

CategoryModelSizeWhat It Does
Embeddersbge-small-en-v1.5128 MBFast, high-quality text embeddings (384d)
Embeddersnomic-embed-text-v1.51.2 GBLarger context window (8192 tokens, 768d)
Embeddersclip-vit-base-patch32584 MBMultimodal — unified image + text embeddings
Embeddersclap-htsat-unfused2 GBAudio embeddings
Embedderssplade-cocondenser508 MBSparse embeddings for learned sparse retrieval
Chunkerschonky-mmbert-small-multilingual-1570 MBSemantic chunking, multilingual
Rerankersmxbai-rerank-base-v1713 MBCross-encoder reranking for search results
Recognizersgliner2-base-v1798 MBNER with customizable label sets
Recognizersrebel-large2.9 GBRelation extraction (entity + relationship)
Rewritersflan-t5-small-squad-qg569 MBQuestion generation and query rewriting
Rewriterspegasus-paraphrase4.5 GBHigh-quality text paraphrasing
Readerspaddleocr-onnx9.8 MBOCR for scanned documents and images
Generatorsfunctiongemma-270m-it1.1 GBSmall local text generation (tool/function calling)

Integrating Small Models Into Your Pipeline

Termite is designed to slot into existing workflows. If you're already using Antfly, small models are built directly into the database — just configure your index with an embedder, chunker, and reranker. If you're using Termite standalone, it exposes an Ollama-compatible API that any HTTP client can call.

With Antfly (built-in)

Small models run automatically as part of your index configuration. No separate service to manage.

# Create an index that uses local models for everything
antfly cli table create --table documents \
  --index '{
    "name": "smart_index",
    "type": "aknn_v0",
    "template": "{{title}} {{body}}",
    "embedder": {
      "provider": "antfly",
      "model": "bge-small-en-v1.5"
    },
    "chunker": {
      "provider": "antfly",
      "text": { "target_tokens": 200, "overlap_tokens": 25 }
    }
  }'

# Query with local reranking and pruning
antfly cli query --table documents \
  --semantic-search "quarterly revenue trends" \
  --indexes "smart_index" \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.5}'

Standalone Termite

Run Termite as an independent service and call it from any language or framework.

import requests

# Generate embeddings
resp = requests.post("http://localhost:11435/api/embed", json={
    "model": "bge-small-en-v1.5",
    "input": ["First document", "Second document"]
})
embeddings = resp.json()["embeddings"]

# Rerank search results
resp = requests.post("http://localhost:11435/api/rerank", json={
    "model": "mxbai-rerank-base-v1",
    "query": "quarterly revenue",
    "documents": ["Q3 financials...", "Team lunch menu..."]
})
ranked = resp.json()["results"]  # sorted by relevance

Where LLMs Still Win

This isn't an argument against LLMs. It's an argument for using the right tool for the job. LLMs are unmatched for tasks that require reasoning, creativity, and handling ambiguity.

Use Small Models For
  • Embedding generation
  • Document chunking
  • Reranking search results
  • Named entity extraction
  • Query rewriting / expansion
  • OCR / document reading
  • Sparse retrieval (SPLADE)
Use LLMs For
  • Generating natural language answers
  • Complex reasoning & synthesis
  • Multi-step planning (agents)
  • Conversational interfaces
  • Creative content generation
  • Summarization with nuance
  • Evaluating output quality (judge)
The best pipelines use both: The optimal architecture uses small models for retrieval infrastructure (chunking, embedding, reranking, extraction) and sends only the final, well-curated context to an LLM for generation. You get the speed and cost benefits of small models and the reasoning power of LLMs — each doing what they're best at.

Start in 5 Minutes

Termite is a single binary. Install it, pull the models you need, and start serving. No Docker, no GPU, no configuration files.

# Install Termite
brew install antfly-io/tap/termite

# Pull a few models
termite pull bge-small-en-v1.5
termite pull mxbai-rerank-base-v1

# Start serving
termite run

# That's it. Embeddings at localhost:11435/api/embed
# Reranking at localhost:11435/api/rerank

Or, if you're using Antfly, Termite is already built in — just start Antfly Swarm and configure your indexes to use local models. The database handles model lifecycle, caching, and distribution automatically.

Stop overpaying for your AI pipeline

Browse the Termite model garden, pull the models you need, and start running inference locally in minutes.