🏗️ Step-by-Step Tutorial

Build a RAG Pipeline
with Antfly

Go from zero to a fully functional Retrieval-Augmented Generation pipeline -- running entirely on your machine with Antfly Swarm and Ollama. No cloud keys required.

Time: ~20 min
Mode: Local (Swarm)
LLM: Ollama
Language: CLI + Python
The Antfly RAG Pipeline

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that grounds LLM responses in your actual data. Instead of relying solely on what a model learned during training, RAG retrieves relevant documents from a knowledge base and passes them as context to the LLM -- producing answers that are accurate, up-to-date, and specific to your domain.

In its simplest form, a RAG pipeline does three things:

RetrieveAugmentGenerate
1. Retrieve
Search your knowledge base using the user's query to find the most relevant documents.
2. Augment
Combine the retrieved documents with the original query into a structured prompt for the LLM.
3. Generate
Pass the augmented prompt to an LLM, which produces a grounded response using the retrieved context.

The quality of your RAG pipeline depends heavily on what happens before retrieval -- how you chunk your documents, which embedding model you choose, and how you index and rank results. Antfly handles all of these as built-in capabilities, so you can focus on your application logic instead of plumbing infrastructure.

Why Antfly for RAG?

Most RAG pipelines require stitching together multiple tools: a vector database for embeddings, a search engine for keyword matching, a chunking library, a reranking service, and orchestration code to connect them all. Antfly collapses this stack into a single binary.

🔀
Hybrid Search Built-in
BM25 keyword search + vector similarity with Reciprocal Rank Fusion (RRF). No plugins, no separate services.
✂️
Automatic Chunking
Termite handles semantic chunking with configurable token targets and overlap. Multi-tier caching keeps it fast.
🧮
Managed Embeddings
Antfly manages the full embedding lifecycle -- generation, storage, and updates. Swap models without rebuilding.
📊
Reranking & Pruning
Built-in cross-encoder reranking and score-based pruning. Improve relevance without external services.
🏠
Local-First
Swarm mode runs everything on your machine. Pair with Ollama for a fully private, zero-API-key pipeline.
📈
Built-in Evals
Measure retrieval quality and generation faithfulness with LLM-as-judge evaluators. No separate eval framework needed.
Antfly vs. the Typical RAG Stack
Traditional stack: Pinecone (vectors) + Elasticsearch (keyword) + LangChain (chunking) + Cohere (reranking) + custom glue code.
Antfly stack: antfly swarm -- one command, all included.

Prerequisites

You'll need two things installed on your machine -- Antfly and Ollama. Both are single binaries with no external dependencies.

ToolPurposeMinimum Requirements
AntflyDatabase, indexing, search, chunking, RAG2 CPU cores, 4GB RAM, 20GB disk
OllamaLocal embedding models + LLM for generation8GB RAM recommended for generation models
# Install Antfly (visit antfly.io/downloads for your platform)
# macOS example:
brew install antfly-io/tap/antfly

# Install Ollama
brew install ollama

# Pull the models we'll use
ollama pull all-minilm          # Embedding model (384 dimensions)
ollama pull gemma3:4b-it-qat    # Generation model for RAG answers
Why these models?
all-minilm is a small, fast embedding model that produces 384-dimensional vectors -- great for getting started. gemma3:4b-it-qat is a quantized 4B parameter model that runs well on consumer hardware. You can swap in larger models later for better quality.

1Start Antfly Swarm

Swarm mode runs the entire Antfly stack -- metadata server, storage nodes, and the Termite model runner -- in a single process. One command, zero configuration.

antfly swarm 2>&1 | tee "antfly.log"

You'll see logs as the metadata server and storage nodes start up. The API is available at http://localhost:8080, which also serves a web dashboard for managing tables and running queries.

MetadataStorageTermiteAPI :8080

All four components run inside a single antfly swarm process

Using Antfly Cloud instead?
If you're using Antfly Cloud, skip this step entirely. Point your CLI or SDK at your cloud cluster endpoint instead of localhost:8080. Everything else in this tutorial works the same.

2Create a Table & Index

In Antfly, a table holds your documents and an index defines how they're searchable. For RAG, you'll typically create an aknn_v0 index that handles both vector embeddings and chunking.

# Create a table called "knowledge_base"
antfly cli table create --table knowledge_base \
  --index '{
    "name": "content_index",
    "type": "aknn_v0",
    "template": "{{title}} {{body}}",
    "embedder": {
      "provider": "ollama",
      "model": "all-minilm"
    },
    "chunker": {
      "provider": "antfly",
      "text": {
        "target_tokens": 200,
        "overlap_tokens": 25
      }
    }
  }'

Let's break down what each part does:

FieldWhat It Does
type: "aknn_v0"Creates an approximate nearest-neighbor vector index using SPANN with RaBitQ quantization
templateA Handlebars template that controls which fields get embedded -- here, the title and body are concatenated
embedderConfigures the embedding provider. Ollama runs locally; you can also use OpenAI, Bedrock, Gemini, or Anthropic
chunkerTermite splits documents into ~200 token chunks with 25 token overlap, preserving semantic coherence
How chunking works in Antfly
When you load a document, Termite's chunker automatically splits it into smaller segments before embedding. Each chunk gets its own vector, so retrieval can find the most relevant section of a document -- not just the document itself. The overlap_tokens parameter ensures context isn't lost at chunk boundaries.

3Load Your Data

Antfly accepts JSON documents. Each document can have any structure -- Antfly is schema-optional. When a document is inserted, the chunker and embedder run asynchronously in the background.

# Download sample Wikipedia articles (11MB, 10K articles)
curl -L -o wiki-articles.json \
  http://fulmicoton.com/tantivy-files/wiki-articles-1000.json

# Load them into your table
antfly cli load --table knowledge_base \
  --file wiki-articles.json \
  --id-field title

The --id-field option tells Antfly to use each article's title as the document ID. Antfly will load the data in batches and start generating embeddings in the background. You can monitor progress:

# Check embedding progress
antfly cli index list --table knowledge_base

# Look for active_vectors to see how many chunks have been embedded
JSON DocsChunkerEmbedderIndex

Ingestion pipeline: documents are chunked and embedded asynchronously after insert

Using your own data?
Antfly can also ingest content directly from URLs -- including web pages, PDFs, and S3 objects. Multimodal support means you can index images and let vision models generate summaries for embedding. See the docs for remote content ingestion.

Antfly supports three search modes. For RAG, hybrid search -- combining keyword and semantic search -- typically gives the best retrieval quality.

Full-Text Search (BM25)

Keyword-based search using Bleve's BM25 ranking. Great for exact term matching.

antfly cli query --table knowledge_base \
  --full-text-search 'body:"Korea"' \
  --fields "title,url" \
  --limit 5

Semantic Search (Vector)

Finds documents with similar meaning, even when different words are used.

antfly cli query --table knowledge_base \
  --semantic-search "anatomy and physiology" \
  --indexes "content_index" \
  --fields "title,url" \
  --limit 5

Hybrid Search (Recommended for RAG)

Combines both approaches using Reciprocal Rank Fusion (RRF). This gets the precision of keyword matching and the recall of semantic search in a single query.

antfly cli query --table knowledge_base \
  --full-text-search 'body:Einstein' \
  --semantic-search "theory of relativity and physics" \
  --indexes "content_index" \
  --fields "title,url" \
  --limit 10 \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.5}'
Reranking
Uses a cross-encoder model to re-score results based on query-document relevance. Improves ordering without changing result count. Antfly's built-in reranker runs locally via Termite.
Pruning
Filters out low-quality results. min_score_ratio keeps results scoring ≥N% of the top hit. max_score_gap_percent detects sharp relevance drop-offs.

5Add RAG Generation

This is where it all comes together. Antfly's RAG endpoint retrieves relevant chunks, feeds them to an LLM as context, and streams back a grounded answer -- all in a single API call.

Streaming RAG Query

antfly cli agents retrieval --table knowledge_base \
  --semantic-search "What are the major events in Korean history?" \
  --indexes "content_index" \
  --fields "title,body" \
  --limit 5 \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.6, "max_score_gap_percent": 40}' \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --system-prompt "You are a helpful assistant. Answer based on the provided context."

Here's what happens under the hood when you run this command:

SearchRerankPruneGenerateStream
1.Hybrid search finds relevant chunks using BM25 + vector similarity
2.Cross-encoder reranker re-scores results for better relevance
3.Pruner removes low-quality results (≥60% of top score, stops at 40% gap)
4.Surviving chunks are formatted and sent to Gemma 3 with your system prompt
5.Response streams back via Server-Sent Events as the LLM generates tokens

Structured JSON Output

Add --streaming=false to get a structured JSON response with the generated answer and source references -- useful for building UIs that need to display citations.

antfly cli agents retrieval --table knowledge_base \
  --semantic-search "Explain the theory of relativity" \
  --indexes "content_index" \
  --fields "title,body,url" \
  --limit 5 \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --streaming=false
Using a cloud LLM?
Swap the generator config to use any supported provider. For example, "provider": "openai", "model": "gpt-4o" or "provider": "anthropic", "model": "claude-sonnet-4-5-20250514". Retrieval still runs locally -- only generation calls the external API.

6Evaluate & Optimize

A RAG pipeline is only as good as its retrieval quality. Antfly includes built-in evaluation metrics so you can measure and improve performance without a separate eval framework.

# Pull a larger model to use as a judge
ollama pull gemma3:12b-it-qat

# Run RAG with evaluation
antfly cli agents retrieval --table knowledge_base \
  --semantic-search "What are the major events in Korean history?" \
  --indexes "content_index" \
  --fields "title,body" \
  --limit 5 \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --eval '{
    "evaluators": ["faithfulness", "relevance"],
    "judge": { "provider": "ollama", "model": "gemma3:12b-it-qat" }
  }'

Antfly supports two categories of evaluation metrics:

CategoryMetricsWhat They Measure
Retrievalrecall precision ndcg mrr mapHow well retrieval finds the right documents (requires ground truth)
LLM-as-Judgefaithfulness relevance completeness coherenceQuality of the generated answer (uses a judge model)

Tuning Tips

Once you have eval scores, here are the most impactful levers to improve your pipeline:

Chunk size
target_tokens
Smaller chunks (100-150 tokens) -> more precise retrieval. Larger chunks (300-500) -> more context per result.
Embedding model
embedder.model
Upgrade to nomic-embed-text (768d, 8192 token context) or mxbai-embed-large-v1 for better quality.
Reranking
--reranker
Always use reranking for RAG. The cross-encoder sees the full query-document pair, catching relevance that embeddings miss.
Pruning thresholds
--pruner
Adjust min_score_ratio (0.4-0.7) and max_score_gap_percent (30-50) to balance context quality vs. coverage.
Multi-index
--indexes
Create indexes with different chunking configs or fields. Search across multiple indexes for broader coverage.

Multi-Index Search

You can create multiple indexes on the same table with different configurations and search across all of them:

# Create a second index with smaller chunks for precise retrieval
antfly cli index create --table knowledge_base \
  --index body_precise \
  --type aknn_v0 \
  --field "body" \
  --embedder '{ "provider": "ollama", "model": "all-minilm" }' \
  --chunker '{ "provider": "antfly", "text": { "target_tokens": 100 } }'

# Search across both indexes
antfly cli query --table knowledge_base \
  --semantic-search "ancient civilizations" \
  --indexes "content_index,body_precise" \
  --fields "title,body" \
  --limit 20

Going to Production

Swarm mode is ideal for development and small deployments. When you're ready to scale, Antfly offers two paths:

Self-Hosted Production
Run separate metadata and storage nodes on your own infrastructure. Use the official antfly-operator for Kubernetes with autoscaling, rolling upgrades, and Prometheus monitoring.
Recommended: 4+ CPU / 8GB+ RAM (metadata), 8+ CPU / 16GB+ RAM / SSD (storage)
Antfly Cloud
Managed deployment with automatic scaling, zero-downtime upgrades, SSO/auth, and pre-built application templates (searchaf, agentaf, chataf). Same API -- just point your client at the cloud endpoint.
Includes: Launch service, pre-built solutions, team management
Same API, any scale
Everything you built in this tutorial -- tables, indexes, queries, RAG pipelines -- works identically whether you're running Swarm on your laptop, a self-hosted Kubernetes cluster, or Antfly Cloud. No code changes needed.

What You Built

In this tutorial, you set up a complete RAG pipeline that:

Runs entirely on your machine -- no cloud API keys, no external dependencies
Stores documents with automatic chunking and embedding via Termite
Searches with hybrid BM25 + vector similarity using RRF fusion
Reranks and prunes results for optimal context quality
Generates grounded answers with streaming LLM responses
Evaluates quality with built-in faithfulness and relevance metrics

Next Steps