🏗️ Step-by-Step Tutorial

Build a RAG Pipeline
with Antfly

Go from zero to a fully functional Retrieval-Augmented Generation pipeline -- running entirely on your machine with Antfly Swarm and Ollama. No cloud keys required.

Time: ~20 min

Mode: Local (Swarm)

LLM: Ollama

Language: CLI + Python

The Antfly RAG Pipeline

On this page

What is RAG?Why Antfly for RAG?Prerequisites Step 1: Start Antfly Swarm Step 2: Create a Table & Index Step 3: Load Your Data Step 4: Query with Hybrid Search Step 5: Add RAG Generation Step 6: Evaluate & Optimize Going to Production

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that grounds LLM responses in your actual data. Instead of relying solely on what a model learned during training, RAG retrieves relevant documents from a knowledge base and passes them as context to the LLM -- producing answers that are accurate, up-to-date, and specific to your domain.

In its simplest form, a RAG pipeline does three things:

1. Retrieve

Search your knowledge base using the user's query to find the most relevant documents.

2. Augment

Combine the retrieved documents with the original query into a structured prompt for the LLM.

3. Generate

Pass the augmented prompt to an LLM, which produces a grounded response using the retrieved context.

The quality of your RAG pipeline depends heavily on what happens before retrieval -- how you chunk your documents, which embedding model you choose, and how you index and rank results. Antfly handles all of these as built-in capabilities, so you can focus on your application logic instead of plumbing infrastructure.

Why Antfly for RAG?

Most RAG pipelines require stitching together multiple tools: a vector database for embeddings, a search engine for keyword matching, a chunking library, a reranking service, and orchestration code to connect them all. Antfly collapses this stack into a single binary.

🔀

Hybrid Search Built-in

BM25 keyword search + vector similarity with Reciprocal Rank Fusion (RRF). No plugins, no separate services.

✂️

Automatic Chunking

Termite handles semantic chunking with configurable token targets and overlap. Multi-tier caching keeps it fast.

🧮

Managed Embeddings

Antfly manages the full embedding lifecycle -- generation, storage, and updates. Swap models without rebuilding.

📊

Reranking & Pruning

Built-in cross-encoder reranking and score-based pruning. Improve relevance without external services.

🏠

Local-First

Swarm mode runs everything on your machine. Pair with Ollama for a fully private, zero-API-key pipeline.

📈

Built-in Evals

Measure retrieval quality and generation faithfulness with LLM-as-judge evaluators. No separate eval framework needed.

Antfly vs. the Typical RAG Stack
Traditional stack: Pinecone (vectors) + Elasticsearch (keyword) + LangChain (chunking) + Cohere (reranking) + custom glue code.
Antfly stack: antfly swarm -- one command, all included.

Prerequisites

You'll need two things installed on your machine -- Antfly and Ollama. Both are single binaries with no external dependencies.

Tool	Purpose	Minimum Requirements
Antfly	Database, indexing, search, chunking, RAG	2 CPU cores, 4GB RAM, 20GB disk
Ollama	Local embedding models + LLM for generation	8GB RAM recommended for generation models

# Install Antfly (visit antfly.io/downloads for your platform)
# macOS example:
brew install antfly-io/tap/antfly

# Install Ollama
brew install ollama

# Pull the models we'll use
ollama pull all-minilm          # Embedding model (384 dimensions)
ollama pull gemma3:4b-it-qat    # Generation model for RAG answers

Why these models?
all-minilm is a small, fast embedding model that produces 384-dimensional vectors -- great for getting started. gemma3:4b-it-qat is a quantized 4B parameter model that runs well on consumer hardware. You can swap in larger models later for better quality.

1Start Antfly Swarm

Swarm mode runs the entire Antfly stack -- metadata server, storage nodes, and the Termite model runner -- in a single process. One command, zero configuration.

antfly swarm 2>&1 | tee "antfly.log"

You'll see logs as the metadata server and storage nodes start up. The API is available at http://localhost:8080, which also serves a web dashboard for managing tables and running queries.

All four components run inside a single antfly swarm process

Using Antfly Cloud instead?
If you're using Antfly Cloud, skip this step entirely. Point your CLI or SDK at your cloud cluster endpoint instead of localhost:8080. Everything else in this tutorial works the same.

2Create a Table & Index

In Antfly, a table holds your documents and an index defines how they're searchable. For RAG, you'll typically create an aknn_v0 index that handles both vector embeddings and chunking.

# Create a table called "knowledge_base"
antfly cli table create --table knowledge_base \
  --index '{
    "name": "content_index",
    "type": "aknn_v0",
    "template": "{{title}} {{body}}",
    "embedder": {
      "provider": "ollama",
      "model": "all-minilm"
    },
    "chunker": {
      "provider": "antfly",
      "text": {
        "target_tokens": 200,
        "overlap_tokens": 25
      }
    }
  }'

Let's break down what each part does:

Field	What It Does
`type: "aknn_v0"`	Creates an approximate nearest-neighbor vector index using SPANN with RaBitQ quantization
`template`	A Handlebars template that controls which fields get embedded -- here, the title and body are concatenated
`embedder`	Configures the embedding provider. Ollama runs locally; you can also use OpenAI, Bedrock, Gemini, or Anthropic
`chunker`	Termite splits documents into ~200 token chunks with 25 token overlap, preserving semantic coherence

How chunking works in Antfly
When you load a document, Termite's chunker automatically splits it into smaller segments before embedding. Each chunk gets its own vector, so retrieval can find the most relevant section of a document -- not just the document itself. The overlap_tokens parameter ensures context isn't lost at chunk boundaries.

3Load Your Data

Antfly accepts JSON documents. Each document can have any structure -- Antfly is schema-optional. When a document is inserted, the chunker and embedder run asynchronously in the background.

# Download sample Wikipedia articles (11MB, 10K articles)
curl -L -o wiki-articles.json \
  http://fulmicoton.com/tantivy-files/wiki-articles-1000.json

# Load them into your table
antfly cli load --table knowledge_base \
  --file wiki-articles.json \
  --id-field title

The --id-field option tells Antfly to use each article's title as the document ID. Antfly will load the data in batches and start generating embeddings in the background. You can monitor progress:

# Check embedding progress
antfly cli index list --table knowledge_base

# Look for active_vectors to see how many chunks have been embedded

Ingestion pipeline: documents are chunked and embedded asynchronously after insert

Using your own data?
Antfly can also ingest content directly from URLs -- including web pages, PDFs, and S3 objects. Multimodal support means you can index images and let vision models generate summaries for embedding. See the docs for remote content ingestion.

4Query with Hybrid Search

Antfly supports three search modes. For RAG, hybrid search -- combining keyword and semantic search -- typically gives the best retrieval quality.

Full-Text Search (BM25)

Keyword-based search using Bleve's BM25 ranking. Great for exact term matching.

antfly cli query --table knowledge_base \
  --full-text-search 'body:"Korea"' \
  --fields "title,url" \
  --limit 5

Semantic Search (Vector)

Finds documents with similar meaning, even when different words are used.

antfly cli query --table knowledge_base \
  --semantic-search "anatomy and physiology" \
  --indexes "content_index" \
  --fields "title,url" \
  --limit 5

Hybrid Search (Recommended for RAG)

Combines both approaches using Reciprocal Rank Fusion (RRF). This gets the precision of keyword matching and the recall of semantic search in a single query.

antfly cli query --table knowledge_base \
  --full-text-search 'body:Einstein' \
  --semantic-search "theory of relativity and physics" \
  --indexes "content_index" \
  --fields "title,url" \
  --limit 10 \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.5}'

Reranking

Uses a cross-encoder model to re-score results based on query-document relevance. Improves ordering without changing result count. Antfly's built-in reranker runs locally via Termite.

Pruning

Filters out low-quality results. min_score_ratio keeps results scoring ≥N% of the top hit. max_score_gap_percent detects sharp relevance drop-offs.

5Add RAG Generation

This is where it all comes together. Antfly's RAG endpoint retrieves relevant chunks, feeds them to an LLM as context, and streams back a grounded answer -- all in a single API call.

Streaming RAG Query

antfly cli agents retrieval --table knowledge_base \
  --semantic-search "What are the major events in Korean history?" \
  --indexes "content_index" \
  --fields "title,body" \
  --limit 5 \
  --reranker '{ "provider": "antfly", "field": "body" }' \
  --pruner '{"min_score_ratio": 0.6, "max_score_gap_percent": 40}' \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --system-prompt "You are a helpful assistant. Answer based on the provided context."

Here's what happens under the hood when you run this command:

1.Hybrid search finds relevant chunks using BM25 + vector similarity

2.Cross-encoder reranker re-scores results for better relevance

3.Pruner removes low-quality results (≥60% of top score, stops at 40% gap)

4.Surviving chunks are formatted and sent to Gemma 3 with your system prompt

5.Response streams back via Server-Sent Events as the LLM generates tokens

Structured JSON Output

Add --streaming=false to get a structured JSON response with the generated answer and source references -- useful for building UIs that need to display citations.

antfly cli agents retrieval --table knowledge_base \
  --semantic-search "Explain the theory of relativity" \
  --indexes "content_index" \
  --fields "title,body,url" \
  --limit 5 \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --streaming=false

Using a cloud LLM?
Swap the generator config to use any supported provider. For example, "provider": "openai", "model": "gpt-4o" or "provider": "anthropic", "model": "claude-sonnet-4-5-20250514". Retrieval still runs locally -- only generation calls the external API.

6Evaluate & Optimize

A RAG pipeline is only as good as its retrieval quality. Antfly includes built-in evaluation metrics so you can measure and improve performance without a separate eval framework.

# Pull a larger model to use as a judge
ollama pull gemma3:12b-it-qat

# Run RAG with evaluation
antfly cli agents retrieval --table knowledge_base \
  --semantic-search "What are the major events in Korean history?" \
  --indexes "content_index" \
  --fields "title,body" \
  --limit 5 \
  --generator '{ "provider": "ollama", "model": "gemma3:4b-it-qat" }' \
  --eval '{
    "evaluators": ["faithfulness", "relevance"],
    "judge": { "provider": "ollama", "model": "gemma3:12b-it-qat" }
  }'

Antfly supports two categories of evaluation metrics:

Category	Metrics	What They Measure
Retrieval	`recall` `precision` `ndcg` `mrr` `map`	How well retrieval finds the right documents (requires ground truth)
LLM-as-Judge	`faithfulness` `relevance` `completeness` `coherence`	Quality of the generated answer (uses a judge model)

Tuning Tips

Once you have eval scores, here are the most impactful levers to improve your pipeline:

Chunk size

target_tokens

Smaller chunks (100-150 tokens) -> more precise retrieval. Larger chunks (300-500) -> more context per result.

Embedding model

embedder.model

Upgrade to nomic-embed-text (768d, 8192 token context) or mxbai-embed-large-v1 for better quality.

Reranking

--reranker

Always use reranking for RAG. The cross-encoder sees the full query-document pair, catching relevance that embeddings miss.

Pruning thresholds

--pruner

Adjust min_score_ratio (0.4-0.7) and max_score_gap_percent (30-50) to balance context quality vs. coverage.

Multi-index

--indexes

Create indexes with different chunking configs or fields. Search across multiple indexes for broader coverage.

Multi-Index Search

You can create multiple indexes on the same table with different configurations and search across all of them:

# Create a second index with smaller chunks for precise retrieval
antfly cli index create --table knowledge_base \
  --index body_precise \
  --type aknn_v0 \
  --field "body" \
  --embedder '{ "provider": "ollama", "model": "all-minilm" }' \
  --chunker '{ "provider": "antfly", "text": { "target_tokens": 100 } }'

# Search across both indexes
antfly cli query --table knowledge_base \
  --semantic-search "ancient civilizations" \
  --indexes "content_index,body_precise" \
  --fields "title,body" \
  --limit 20

Going to Production

Swarm mode is ideal for development and small deployments. When you're ready to scale, Antfly offers two paths:

Self-Hosted Production

Run separate metadata and storage nodes on your own infrastructure. Use the official antfly-operator for Kubernetes with autoscaling, rolling upgrades, and Prometheus monitoring.

Recommended: 4+ CPU / 8GB+ RAM (metadata), 8+ CPU / 16GB+ RAM / SSD (storage)

Antfly Cloud

Managed deployment with automatic scaling, zero-downtime upgrades, SSO/auth, and pre-built application templates (searchaf, agentaf, chataf). Same API -- just point your client at the cloud endpoint.

Includes: Launch service, pre-built solutions, team management

Same API, any scale
Everything you built in this tutorial -- tables, indexes, queries, RAG pipelines -- works identically whether you're running Swarm on your laptop, a self-hosted Kubernetes cluster, or Antfly Cloud. No code changes needed.

What You Built

In this tutorial, you set up a complete RAG pipeline that:

Runs entirely on your machine -- no cloud API keys, no external dependencies

Stores documents with automatic chunking and embedding via Termite

Searches with hybrid BM25 + vector similarity using RRF fusion

Reranks and prunes results for optimal context quality

Generates grounded answers with streaming LLM responses

Evaluates quality with built-in faithfulness and relevance metrics

Next Steps

Explore Models →

Browse Termite models with antfly termite list --remote for embedding, chunking, and reranking options.

Try Multimodal →

Index images alongside text using vision models. Antfly handles download, processing, and embedding.

Build a Searchbar →

Use the REST API to build a search UI. Antfly's Answerbar and Searchbar components are ready-made.

Graph Queries →

Add graph indexes for relationship traversal and pathfinding across your knowledge base.