Engineering Blog8 min readAI Infrastructure

Stop Taking a Helicopter
to the Grocery Store

Why small, task-specific models outperform LLMs for 80% of your AI pipeline — and how to run them locally for free with Termite.

The LLM-for-Everything Trap

Here's a pattern we see constantly: a team builds a RAG pipeline, and every step — chunking, embedding, reranking, entity extraction — gets routed through an LLM API. GPT-4o for chunking. OpenAI embeddings for vectors. Another LLM call for query rewriting. Maybe Cohere for reranking.

It works. But it's the equivalent of taking a helicopter to the grocery store. You'll get there, sure — but you're burning jet fuel for a task that a bicycle handles better.

LLMs are general-purpose reasoning engines. They're extraordinary at understanding context, generating text, and handling ambiguous instructions. But most of the steps in an AI pipeline aren't ambiguous. Chunking is chunking. Embedding is embedding. Reranking is reranking. These are well-defined, narrow tasks — and there are models specifically trained to do each one faster, cheaper, and often better than a general-purpose LLM.

🚁

Using an LLM

for embeddings, chunking, reranking...

Latency

200-800ms

Cost per 1M tokens

$0.02-0.13

Privacy

Data leaves infra

Parameters

7B-70B+

🚲

Small, Task-Specific Model

purpose-built ONNX model on Termite

Latency

1-15ms

Cost per 1M tokens

$0.00 (local)

Privacy

Never leaves machine

Parameters

30M-300M

What Are "Small Models"?

We're talking about models in the 30 million to 300 million parameter range — roughly 100x to 1,000x smaller than models like GPT-4o or Claude. They're typically trained for a single task: generating embeddings, splitting text into semantic chunks, re-scoring search results, or extracting named entities.

What makes them powerful isn't their size — it's their specificity. A 128 MB embedding model that was trained on billions of sentence pairs for one job (producing good vectors) will outperform a 70B parameter model that's trying to do everything. It's the same reason a Formula 1 car beats a helicopter on a racetrack.

10-100x

Faster inference

1-15ms vs 200-800ms per call. Small models run in microseconds on CPU.

100%

Cost reduction

Local ONNX inference on hardware you already own. No API bills, no per-token charges.

<3 GB

Total footprint

Five task-specific models — embedder, chunker, reranker, NER, rewriter — fit in under 3 GB.

Task-by-Task: Where Small Models Win

Let's walk through each step of a typical AI pipeline and see what happens when you swap the LLM for a purpose-built model.

1. Embeddings

Embedding models convert text into dense vectors that capture semantic meaning. This is arguably the most important step in any search or RAG pipeline — and it's also the most wasteful place to use an LLM.

Models like bge-small-en-v1.5 (128 MB, 384 dimensions) or nomic-embed-text-v1.5 (1.2 GB, 768 dimensions) are specifically trained for producing high-quality embeddings. They consistently score at or near the top of the MTEB leaderboard — the standard benchmark for embedding quality — despite being a fraction of the size of general LLMs.

For multimodal use cases, Termite also ships clip-vit-base-patch32 (584 MB) for unified image and text embeddings, and clap-htsat-unfused (2 GB) for audio.

The math at scale: OpenAI's text-embedding-3-small costs $0.02 per million tokens. If you're embedding 1 billion tokens per month (common for enterprise document ingestion), that's $20/month just for embeddings. Running bge-small-en-v1.5 locally on Termite costs $0 — and it's faster, with 1-3ms per embedding vs. 50-200ms for an API round-trip.

2. Semantic Chunking

Most teams either chunk by fixed token counts (which breaks mid-sentence) or use an LLM to identify split points (which costs $2.50 per million tokens on GPT-4o). There's a better option: small transformer models trained specifically for document segmentation.

Termite's chonky-mmbert-small-multilingual-1 (570 MB) understands document structure and splits at natural paragraph and topic boundaries. It supports multiple languages and runs in 2-5ms per document — fast enough to chunk millions of documents in hours on a single machine.

3. Reranking

Reranking is the single highest-leverage improvement you can make to a RAG pipeline. A reranker takes your initial search results and re-scores each one by looking at the full query-document pair together (cross-attention), catching relevance signals that embeddings alone miss.

Cohere's Rerank API costs $2.00 per 1,000 searches. Termite's mxbai-rerank-base-v1 (713 MB) provides comparable quality at zero marginal cost. At 100K searches per month, that's $200/month saved.

The real advantage isn't just cost — it's latency. A local reranker adds 5-10ms to your pipeline. An API call adds 100-300ms plus network variability. For user-facing search, that difference is the line between "snappy" and "sluggish."

4. Named Entity Recognition

NER extracts structured information (people, organizations, locations, dates) from unstructured text. This is incredibly useful for building faceted search, populating knowledge graphs, or enriching documents with metadata before indexing.

Instead of prompting GPT-4o with "extract all entities from this text" at $2.50/M tokens, Termite's gliner2-base-v1 (798 MB) handles it locally with customizable label sets. For relationship extraction (who works at what company, what event happened where), rebel-large (2.9 GB) maps entities and their relationships in a single pass.

5. Query Rewriting & Expansion

When a user searches for "quarterly results," they might mean financial reports, academic grades, or sports scores. Query rewriting generates multiple variants to improve recall. Termite's flan-t5-small-squad-qg (569 MB) handles question generation and paraphrasing locally, while pegasus-paraphrase (4.5 GB) offers higher-quality paraphrasing for more demanding use cases.

The Real Cost of LLM-Powered Pipelines

Let's put real numbers on it. Here's what a typical RAG pipeline costs when every step runs through an LLM API — versus running the same pipeline with task-specific models on Termite.

Pipeline Task	LLM / API Approach	Monthly Cost	Small Model (Termite)	Monthly Cost
Embeddings	OpenAI text-embedding-3-small	$20	bge-small-en-v1.5 (128 MB)	$0
Reranking	Cohere Rerank 3.5	$200	mxbai-rerank-base-v1 (713 MB)	$0
Chunking	GPT-4o for semantic splitting	$250	chonky-mmbert (570 MB)	$0
NER / Extraction	GPT-4o for entity extraction	$125	gliner2-base-v1 (798 MB)	$0
Query Rewriting	GPT-4o-mini	$15	flan-t5-small-squad-qg (569 MB)	$0
Total Pipeline		~$610/mo		$0/mo

Estimates based on published API pricing (OpenAI, Cohere) as of early 2026. Actual costs vary by volume and provider.

That's ~$610/month in API costs that drops to effectively zero. And the dollar amount is only part of the story. The hidden costs of API-dependent pipelines are just as significant:

Latency compounding

Five API calls at 200ms each adds a full second to every query. Small models add 20-50ms total.

Data leaves your infrastructure

Every API call sends your documents to a third party. For healthcare, legal, and financial data, this may not be an option.

Rate limits & outages

API rate limits throttle your throughput. Provider outages take your entire pipeline down.

Non-deterministic results

LLM outputs vary between calls. Small models produce consistent, reproducible results every time.

The Termite Model Garden

Termite is a local ML inference server that runs ONNX models with an Ollama-compatible API. Think of it as Ollama, but for all the other models in your AI pipeline — the ones that aren't LLMs. Embeddings. Chunking. Reranking. NER. OCR. Rewriting.

It ships with a curated model garden of task-specific models, all optimized for CPU inference with FP16 and INT8 quantization options. One command to pull a model, one command to run it.

Small Models in Your Pipeline — All Running on Termite

→

Total model footprint: ~2.8 GB — all five models fit on a laptop with room to spare

# Pull and run models with Termite — just like Ollama
termite pull bge-small-en-v1.5        # Embeddings (128 MB)
termite pull mxbai-rerank-base-v1     # Reranking (713 MB)
termite pull chonky-mmbert-small-multilingual-1  # Chunking (570 MB)
termite pull gliner2-base-v1          # NER (798 MB)

# Start serving all models
termite run --models-dir ./models

# Use the Ollama-compatible API
curl http://localhost:11435/api/embed \
  -d '{"model": "bge-small-en-v1.5", "input": "Hello world"}'

Available Models

Every model in the Termite garden runs locally via ONNX Runtime, with optional XLA and Go backends. Browse the full catalog at antfly.io/termite/models.

Category	Model	Size	What It Does
Embedders	bge-small-en-v1.5	128 MB	Fast, high-quality text embeddings (384d)
Embedders	nomic-embed-text-v1.5	1.2 GB	Larger context window (8192 tokens, 768d)
Embedders	clip-vit-base-patch32	584 MB	Multimodal — unified image + text embeddings
Embedders	clap-htsat-unfused	2 GB	Audio embeddings
Embedders	splade-cocondenser	508 MB	Sparse embeddings for learned sparse retrieval
Chunkers	chonky-mmbert-small-multilingual-1	570 MB	Semantic chunking, multilingual
Rerankers	mxbai-rerank-base-v1	713 MB	Cross-encoder reranking for search results
Recognizers	gliner2-base-v1	798 MB	NER with customizable label sets
Recognizers	rebel-large	2.9 GB	Relation extraction (entity + relationship)
Rewriters	flan-t5-small-squad-qg	569 MB	Question generation and query rewriting
Rewriters	pegasus-paraphrase	4.5 GB	High-quality text paraphrasing
Readers	paddleocr-onnx	9.8 MB	OCR for scanned documents and images
Generators	functiongemma-270m-it	1.1 GB	Small local text generation (tool/function calling)

Integrating Small Models Into Your Pipeline

Termite is designed to slot into existing workflows. If you're already using Antfly, small models are built directly into the database — just configure your index with an embedder, chunker, and reranker. If you're using Termite standalone, it exposes an Ollama-compatible API that any HTTP client can call.

With Antfly (built-in)

Small models run automatically as part of your index configuration. No separate service to manage.

# Create an index that uses local models for everything
antfly cli table create --table documents \
  --index '{
    "name": "smart_index",
    "type": "aknn_v0",
    "template": "{{title}} {{body}}",
    "embedder": {
      "provider": "antfly",
      "model": "bge-small-en-v1.5"
    },
    "chunker": {
      "provider": "antfly",
      "text": { "target_tokens": 200, "overlap_tokens": 25 }
    }
  }'

# Query with local reranking and pruning
antfly cli query --table documents \
  --semantic-search "quarterly revenue trends" \
  --indexes "smart_index" \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.5}'

Standalone Termite

Run Termite as an independent service and call it from any language or framework.

import requests

# Generate embeddings
resp = requests.post("http://localhost:11435/api/embed", json={
    "model": "bge-small-en-v1.5",
    "input": ["First document", "Second document"]
})
embeddings = resp.json()["embeddings"]

# Rerank search results
resp = requests.post("http://localhost:11435/api/rerank", json={
    "model": "mxbai-rerank-base-v1",
    "query": "quarterly revenue",
    "documents": ["Q3 financials...", "Team lunch menu..."]
})
ranked = resp.json()["results"]  # sorted by relevance

Where LLMs Still Win

This isn't an argument against LLMs. It's an argument for using the right tool for the job. LLMs are unmatched for tasks that require reasoning, creativity, and handling ambiguity.

Use Small Models For

Embedding generation
Document chunking
Reranking search results
Named entity extraction
Query rewriting / expansion
OCR / document reading
Sparse retrieval (SPLADE)

Use LLMs For

Generating natural language answers
Complex reasoning & synthesis
Multi-step planning (agents)
Conversational interfaces
Creative content generation
Summarization with nuance
Evaluating output quality (judge)

The best pipelines use both: The optimal architecture uses small models for retrieval infrastructure (chunking, embedding, reranking, extraction) and sends only the final, well-curated context to an LLM for generation. You get the speed and cost benefits of small models and the reasoning power of LLMs — each doing what they're best at.

Start in 5 Minutes

Termite is a single binary. Install it, pull the models you need, and start serving. No Docker, no GPU, no configuration files.

# Install Termite
brew install antfly-io/tap/termite

# Pull a few models
termite pull bge-small-en-v1.5
termite pull mxbai-rerank-base-v1

# Start serving
termite run

# That's it. Embeddings at localhost:11435/api/embed
# Reranking at localhost:11435/api/rerank

Or, if you're using Antfly, Termite is already built in — just start Antfly Swarm and configure your indexes to use local models. The database handles model lifecycle, caching, and distribution automatically.

Stop overpaying for your AI pipeline

Browse the Termite model garden, pull the models you need, and start running inference locally in minutes.

Browse Models Get Started with Termite