Build a RAG Pipeline
with Antfly
Go from zero to a fully functional Retrieval-Augmented Generation pipeline -- running entirely on your machine with Antfly Swarm and Ollama. No cloud keys required.
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that grounds LLM responses in your actual data. Instead of relying solely on what a model learned during training, RAG retrieves relevant documents from a knowledge base and passes them as context to the LLM -- producing answers that are accurate, up-to-date, and specific to your domain.
In its simplest form, a RAG pipeline does three things:
The quality of your RAG pipeline depends heavily on what happens before retrieval -- how you chunk your documents, which embedding model you choose, and how you index and rank results. Antfly handles all of these as built-in capabilities, so you can focus on your application logic instead of plumbing infrastructure.
Why Antfly for RAG?
Most RAG pipelines require stitching together multiple tools: a vector database for embeddings, a search engine for keyword matching, a chunking library, a reranking service, and orchestration code to connect them all. Antfly collapses this stack into a single binary.
Traditional stack: Pinecone (vectors) + Elasticsearch (keyword) + LangChain (chunking) + Cohere (reranking) + custom glue code.
Antfly stack:
antfly swarm -- one command, all included.Prerequisites
You'll need two things installed on your machine -- Antfly and Ollama. Both are single binaries with no external dependencies.
| Tool | Purpose | Minimum Requirements |
|---|---|---|
| Antfly | Database, indexing, search, chunking, RAG | 2 CPU cores, 4GB RAM, 20GB disk |
| Ollama | Local embedding models + LLM for generation | 8GB RAM recommended for generation models |
# Install Antfly (visit antfly.io/downloads for your platform)
# macOS example:
brew install antfly-io/tap/antfly
# Install Ollama
brew install ollama
# Pull the models we'll use
ollama pull all-minilm # Embedding model (384 dimensions)
ollama pull gemma3:4b-it-qat # Generation model for RAG answersall-minilm is a small, fast embedding model that produces 384-dimensional vectors -- great for getting started. gemma3:4b-it-qat is a quantized 4B parameter model that runs well on consumer hardware. You can swap in larger models later for better quality.1Start Antfly Swarm
Swarm mode runs the entire Antfly stack -- metadata server, storage nodes, and the Termite model runner -- in a single process. One command, zero configuration.
antfly swarm 2>&1 | tee "antfly.log"You'll see logs as the metadata server and storage nodes start up. The API is available at http://localhost:8080, which also serves a web dashboard for managing tables and running queries.
All four components run inside a single antfly swarm process
If you're using Antfly Cloud, skip this step entirely. Point your CLI or SDK at your cloud cluster endpoint instead of
localhost:8080. Everything else in this tutorial works the same.2Create a Table & Index
In Antfly, a table holds your documents and an index defines how they're searchable. For RAG, you'll typically create an aknn_v0 index that handles both vector embeddings and chunking.
# Create a table called "knowledge_base"
antfly cli table create --table knowledge_base \
--index '{
"name": "content_index",
"type": "aknn_v0",
"template": "{{title}} {{body}}",
"embedder": {
"provider": "ollama",
"model": "all-minilm"
},
"chunker": {
"provider": "antfly",
"text": {
"target_tokens": 200,
"overlap_tokens": 25
}
}
}'Let's break down what each part does:
| Field | What It Does |
|---|---|
type: "aknn_v0" | Creates an approximate nearest-neighbor vector index using SPANN with RaBitQ quantization |
template | A Handlebars template that controls which fields get embedded -- here, the title and body are concatenated |
embedder | Configures the embedding provider. Ollama runs locally; you can also use OpenAI, Bedrock, Gemini, or Anthropic |
chunker | Termite splits documents into ~200 token chunks with 25 token overlap, preserving semantic coherence |
When you load a document, Termite's chunker automatically splits it into smaller segments before embedding. Each chunk gets its own vector, so retrieval can find the most relevant section of a document -- not just the document itself. The
overlap_tokens parameter ensures context isn't lost at chunk boundaries.3Load Your Data
Antfly accepts JSON documents. Each document can have any structure -- Antfly is schema-optional. When a document is inserted, the chunker and embedder run asynchronously in the background.
# Download sample Wikipedia articles (11MB, 10K articles)
curl -L -o wiki-articles.json \
http://fulmicoton.com/tantivy-files/wiki-articles-1000.json
# Load them into your table
antfly cli load --table knowledge_base \
--file wiki-articles.json \
--id-field titleThe --id-field option tells Antfly to use each article's title as the document ID. Antfly will load the data in batches and start generating embeddings in the background. You can monitor progress:
# Check embedding progress
antfly cli index list --table knowledge_base
# Look for active_vectors to see how many chunks have been embeddedIngestion pipeline: documents are chunked and embedded asynchronously after insert
Antfly can also ingest content directly from URLs -- including web pages, PDFs, and S3 objects. Multimodal support means you can index images and let vision models generate summaries for embedding. See the docs for remote content ingestion.
4Query with Hybrid Search
Antfly supports three search modes. For RAG, hybrid search -- combining keyword and semantic search -- typically gives the best retrieval quality.
Full-Text Search (BM25)
Keyword-based search using Bleve's BM25 ranking. Great for exact term matching.
antfly cli query --table knowledge_base \
--full-text-search 'body:"Korea"' \
--fields "title,url" \
--limit 5Semantic Search (Vector)
Finds documents with similar meaning, even when different words are used.
antfly cli query --table knowledge_base \
--semantic-search "anatomy and physiology" \
--indexes "content_index" \
--fields "title,url" \
--limit 5Hybrid Search (Recommended for RAG)
Combines both approaches using Reciprocal Rank Fusion (RRF). This gets the precision of keyword matching and the recall of semantic search in a single query.
antfly cli query --table knowledge_base \
--full-text-search 'body:Einstein' \
--semantic-search "theory of relativity and physics" \
--indexes "content_index" \
--fields "title,url" \
--limit 10 \
--reranker '{ "provider": "antfly", "field": "body" }' \
--pruner '{"min_score_ratio": 0.5}'min_score_ratio keeps results scoring âĽN% of the top hit. max_score_gap_percent detects sharp relevance drop-offs.5Add RAG Generation
This is where it all comes together. Antfly's RAG endpoint retrieves relevant chunks, feeds them to an LLM as context, and streams back a grounded answer -- all in a single API call.
Streaming RAG Query
antfly cli agents retrieval --table knowledge_base \
--semantic-search "What are the major events in Korean history?" \
--indexes "content_index" \
--fields "title,body" \
--limit 5 \
--reranker '{ "provider": "antfly", "field": "body" }' \
--pruner '{"min_score_ratio": 0.6, "max_score_gap_percent": 40}' \
--generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
--system-prompt "You are a helpful assistant. Answer based on the provided context."Here's what happens under the hood when you run this command:
Structured JSON Output
Add --streaming=false to get a structured JSON response with the generated answer and source references -- useful for building UIs that need to display citations.
antfly cli agents retrieval --table knowledge_base \
--semantic-search "Explain the theory of relativity" \
--indexes "content_index" \
--fields "title,body,url" \
--limit 5 \
--generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
--streaming=falseSwap the generator config to use any supported provider. For example,
"provider": "openai", "model": "gpt-4o" or "provider": "anthropic", "model": "claude-sonnet-4-5-20250514". Retrieval still runs locally -- only generation calls the external API.6Evaluate & Optimize
A RAG pipeline is only as good as its retrieval quality. Antfly includes built-in evaluation metrics so you can measure and improve performance without a separate eval framework.
# Pull a larger model to use as a judge
ollama pull gemma3:12b-it-qat
# Run RAG with evaluation
antfly cli agents retrieval --table knowledge_base \
--semantic-search "What are the major events in Korean history?" \
--indexes "content_index" \
--fields "title,body" \
--limit 5 \
--generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
--eval '{
"evaluators": ["faithfulness", "relevance"],
"judge": { "provider": "ollama", "model": "gemma3:12b-it-qat" }
}'Antfly supports two categories of evaluation metrics:
| Category | Metrics | What They Measure |
|---|---|---|
| Retrieval | recall precision ndcg mrr map | How well retrieval finds the right documents (requires ground truth) |
| LLM-as-Judge | faithfulness relevance completeness coherence | Quality of the generated answer (uses a judge model) |
Tuning Tips
Once you have eval scores, here are the most impactful levers to improve your pipeline:
target_tokensembedder.model--reranker--pruner--indexesMulti-Index Search
You can create multiple indexes on the same table with different configurations and search across all of them:
# Create a second index with smaller chunks for precise retrieval
antfly cli index create --table knowledge_base \
--index body_precise \
--type aknn_v0 \
--field "body" \
--embedder '{ "provider": "ollama", "model": "all-minilm" }' \
--chunker '{ "provider": "antfly", "text": { "target_tokens": 100 } }'
# Search across both indexes
antfly cli query --table knowledge_base \
--semantic-search "ancient civilizations" \
--indexes "content_index,body_precise" \
--fields "title,body" \
--limit 20Going to Production
Swarm mode is ideal for development and small deployments. When you're ready to scale, Antfly offers two paths:
antfly-operator for Kubernetes with autoscaling, rolling upgrades, and Prometheus monitoring.Everything you built in this tutorial -- tables, indexes, queries, RAG pipelines -- works identically whether you're running Swarm on your laptop, a self-hosted Kubernetes cluster, or Antfly Cloud. No code changes needed.
What You Built
In this tutorial, you set up a complete RAG pipeline that: