Termite API#

Termite is an Ollama-like local inference server for ONNX-based ML models.

What is Termite#

Termite provides local ML inference with an Ollama-compatible API:

Embedding Generation: Text and multimodal (CLIP) embedding models
Text Chunking: Semantic chunking with ONNX models or fixed-size fallback
Reranking: Relevance re-scoring for search results
Named Entity Recognition: Extract persons, organizations, locations from text
Text Rewriting: Transform text using Seq2Seq models (question generation, query generation, etc.)

Download the latest release at https://antfly.io/docs/downloads

When to Use Termite#

Termite can run standalone or as part of an Antfly cluster:

Local ONNX model inference without external API dependencies
Ollama-compatible /api/embed endpoint for embeddings
Semantic text chunking for RAG pipelines
Relevance reranking for improved search quality
Centralized model serving across distributed nodes
Privacy-preserving ML inference (data never leaves your infrastructure)

Features#

Embedding Generation#

Models: ONNX models auto-discovered from {models_dir}/embedders/
API: Ollama-compatible /api/embed endpoint
Response Formats: Binary (default), JSON

Multimodal Support (CLIP)#

Image Embeddings: CLIP models for joint text-image embedding space
Input Formats: Base64 data URIs (data:image/png;base64,...) or URLs
OpenAI-Compatible: Uses content parts format ({"type": "image_url", "image_url": {"url": "..."}})
Use Cases: Image search, cross-modal retrieval, visual similarity

Text Chunking#

Models: Fixed-size chunking (always available) + ONNX models
Model Discovery: Auto-discovers models from {models_dir}/chunkers/
Caching: 2-minute TTL memory cache
Fallback: Falls back to fixed chunking if model fails

Reranking#

Model Discovery: Auto-discovers ONNX models from {models_dir}/rerankers/
Quantization: Automatically uses quantized models if available
Input: Pre-rendered text prompts (client handles field extraction)

Generate embeddings#

POST/embed

Generates vector embeddings for input content using local ONNX models. This endpoint is compatible with Ollama's /api/embed format for text, and extends it with OpenAI-compatible multimodal support for CLIP models.

Models#

Models are auto-discovered from models_dir/embedders/ at startup. Use the /api/models endpoint to list available models.

Text-only models (e.g., BAAI/bge-small-en-v1.5): Accept text strings
Multimodal models (e.g., CLIP): Accept text and images via data URIs

Input Formats#

Three formats are supported:

Single text string: "hello world"
Array of text strings: ["hello", "world"] (Ollama-compatible)
Array of content parts: [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}] (OpenAI-compatible)

Caching#

Results are cached in memory for 2 minutes. Concurrent identical requests are deduplicated using singleflight to prevent redundant work.

Response Formats#

Supports multiple content types via Accept header:

application/octet-stream: Binary serialization (default, most efficient)
application/json: JSON response with model name and embeddings

Examples#

Text embedding (Ollama-compatible):

{
  "model": "BAAI/bge-small-en-v1.5",
  "input": ["hello world", "machine learning"]
}

Multimodal embedding (OpenAI-compatible):

{
  "model": "openai/clip-vit-base-patch32",
  "input": [
    {"type": "text", "text": "a photo of a cat"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0..."}}
  ]
}

Request Body#

Example:

{
  "model": "BAAI/bge-small-en-v1.5",
  "input": [
    "hello world",
    "machine learning"
  ],
  "truncate": true
}

Code Examples#

curl -X POST "http://localhost:8082/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "BAAI/bge-small-en-v1.5",
  "input": [
    "hello world",
    "machine learning"
  ],
  "truncate": true
}'

import requests

response = requests.post(
    "http://localhost:8082/api/embed",
    json={
        "model": "BAAI/bge-small-en-v1.5",
        "input": [
            "hello world",
            "machine learning"
        ],
        "truncate": true
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/embed", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "model": "BAAI/bge-small-en-v1.5",
    "input": [
      "hello world",
      "machine learning"
    ],
    "truncate": true
  })
});

const data = await response.json();

Responses#

{
  "model": "BAAI/bge-small-en-v1.5",
  "embeddings": [
    [
      0.0123,
      -0.0456,
      0.0789
    ],
    [
      0.0234,
      -0.0567,
      0.089
    ]
  ]
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

Chunk text into smaller segments#

POST/chunk

Splits text into smaller chunks using semantic or fixed-size chunking models.

Models#

Fixed Chunking (always available)#

Simple token-based splitting with overlap
Use model="fixed"
Fast and deterministic

ONNX Models#

Semantic chunking based on content similarity
Models auto-discovered from models_dir/chunkers/
Falls back to fixed chunking if model fails

Caching#

Results are cached in memory for 2 minutes. Cache key includes both config and text content.

Example#

{
  "text": "This is a long document...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n"
  }
}

Request Body#

Example:

{
  "text": "This is a long document that needs to be split into smaller chunks...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n",
    "max_chunks": 50,
    "threshold": 0.5
  }
}

Code Examples#

curl -X POST "http://localhost:8082/api/chunk" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "This is a long document that needs to be split into smaller chunks...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n",
    "max_chunks": 50,
    "threshold": 0.5
  }
}'

import requests

response = requests.post(
    "http://localhost:8082/api/chunk",
    json={
        "text": "This is a long document that needs to be split into smaller chunks...",
        "config": {
            "model": "fixed",
            "target_tokens": 500,
            "overlap_tokens": 50,
            "separator": "\n\n",
            "max_chunks": 50,
            "threshold": 0.5
        }
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/chunk", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "text": "This is a long document that needs to be split into smaller chunks...",
    "config": {
      "model": "fixed",
      "target_tokens": 500,
      "overlap_tokens": 50,
      "separator": "\n\n",
      "max_chunks": 50,
      "threshold": 0.5
    }
  })
});

const data = await response.json();

Responses#

{
  "chunks": [
    {
      "id": 0,
      "text": "This is the first chunk...",
      "start_char": 0,
      "end_char": 100
    },
    {
      "id": 1,
      "text": "This is the second chunk...",
      "start_char": 90,
      "end_char": 190
    }
  ],
  "model": "fixed",
  "cache_hit": false
}

{
  "error": "string"
}

{
  "error": "string"
}

Rerank prompts by relevance#

POST/rerank

Re-scores pre-rendered text prompts based on relevance to a query using ONNX reranking models.

Client Responsibilities#

The client must:

Extract relevant fields from documents
Render any templates
Send pre-rendered text strings as prompts

This design keeps Termite stateless and allows clients to customize rendering logic.

Models#

Models are auto-discovered from models_dir/rerankers/
Supports quantized models (model_quantized.onnx)
Automatically prefers quantized variants if available

Example#

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to Machine Learning: This guide covers...",
    "Deep Learning Fundamentals: Neural networks are..."
  ]
}

For document-based reranking with field extraction, use the client-side lib/reranking package which handles rendering before calling this endpoint.

Request Body#

Example:

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to machine learning...",
    "Deep learning fundamentals..."
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/rerank" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to machine learning...",
    "Deep learning fundamentals..."
  ]
}'

import requests

response = requests.post(
    "http://localhost:8082/api/rerank",
    json={
        "model": "BAAI/bge-reranker-v2-m3",
        "query": "machine learning applications",
        "prompts": [
            "Introduction to machine learning...",
            "Deep learning fundamentals..."
        ]
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/rerank", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "model": "BAAI/bge-reranker-v2-m3",
    "query": "machine learning applications",
    "prompts": [
      "Introduction to machine learning...",
      "Deep learning fundamentals..."
    ]
  })
});

const data = await response.json();

Responses#

{
  "model": "string",
  "scores": [
    0
  ]
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

Generate text using LLM (OpenAI-compatible)#

POST/generate

Generates text using local LLM models (e.g., Gemma 3). Fully compatible with the OpenAI Chat Completions API.

Models#

Models are auto-discovered from models_dir/generators/ at startup. Use the /api/models endpoint to list available models.

Streaming#

Set stream: true to receive Server-Sent Events (SSE) with incremental token deltas. Each event contains a ChatCompletionChunk object. The stream ends with data: [DONE].

Input Format#

Uses OpenAI-compatible chat format with messages array:

{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "stream": false
}

Example (Non-streaming)#

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-1b-it",
    "messages": [{"role": "user", "content": "What is machine learning?"}],
    "max_tokens": 100
  }'

Example (Streaming)#

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-1b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Request Body#

Example:

{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {
      "role": "system",
      "content": "string",
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"San Francisco, CA\"}"
          }
        }
      ],
      "tool_call_id": "string"
    }
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0,
  "top_k": 0,
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            }
          },
          "required": [
            "location"
          ]
        },
        "strict": true
      }
    }
  ],
  "tool_choice": "auto"
}

Code Examples#

curl -X POST "http://localhost:8082/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {
      "role": "system",
      "content": "string",
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"San Francisco, CA\"}"
          }
        }
      ],
      "tool_call_id": "string"
    }
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0,
  "top_k": 0,
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            }
          },
          "required": [
            "location"
          ]
        },
        "strict": true
      }
    }
  ],
  "tool_choice": "auto"
}'

import requests

response = requests.post(
    "http://localhost:8082/api/generate",
    json={
        "model": "google/gemma-3-1b-it",
        "messages": [
            {
                "role": "system",
                "content": "string",
                "tool_calls": [
                    {
                        "id": "call_abc123",
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\": \"San Francisco, CA\"}"
                        }
                    }
                ],
                "tool_call_id": "string"
            }
        ],
        "max_tokens": 256,
        "temperature": 0.7,
        "top_p": 0,
        "top_k": 0,
        "stream": true,
        "tools": [
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "Get the current weather in a location",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. San Francisco, CA"
                            }
                        },
                        "required": [
                            "location"
                        ]
                    },
                    "strict": true
                }
            }
        ],
        "tool_choice": "auto"
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "model": "google/gemma-3-1b-it",
    "messages": [
      {
        "role": "system",
        "content": "string",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco, CA\"}"
            }
          }
        ],
        "tool_call_id": "string"
      }
    ],
    "max_tokens": 256,
    "temperature": 0.7,
    "top_p": 0,
    "top_k": 0,
    "stream": true,
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather in a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
              }
            },
            "required": [
              "location"
            ]
          },
          "strict": true
        }
      }
    ],
    "tool_choice": "auto"
  })
});

const data = await response.json();

Responses#

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704123456,
  "model": "string",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "system",
        "content": "string",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco, CA\"}"
            }
          }
        ]
      },
      "finish_reason": "stop",
      "logprobs": {}
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

Recognize named entities#

POST/recognize

Recognizes named entities (persons, organizations, locations, etc.) from text using ONNX recognition models.

Entity Types#

Standard CoNLL entity types:

PER: Person names (e.g., "John Smith")
ORG: Organizations (e.g., "Google", "Apple Inc.")
LOC: Locations (e.g., "New York", "France")
MISC: Miscellaneous entities

Models#

Models are auto-discovered from models_dir/recognizers/
Supports quantized variants (model_i8.onnx)
Compatible with HuggingFace BERT-based recognition models
GLiNER models support custom entity labels via the labels parameter

Example#

{
  "model": "dslim/bert-base-NER",
  "texts": ["John Smith works at Google.", "Apple Inc. is in Cupertino."]
}

Request Body#

Example:

{
  "model": "dslim/bert-base-NER",
  "texts": [
    "John Smith works at Google.",
    "Apple Inc. is in Cupertino."
  ],
  "labels": [
    "person",
    "company",
    "product",
    "date"
  ],
  "relation_labels": [
    "founded",
    "works_at",
    "located_in"
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/recognize" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "dslim/bert-base-NER",
  "texts": [
    "John Smith works at Google.",
    "Apple Inc. is in Cupertino."
  ],
  "labels": [
    "person",
    "company",
    "product",
    "date"
  ],
  "relation_labels": [
    "founded",
    "works_at",
    "located_in"
  ]
}'

import requests

response = requests.post(
    "http://localhost:8082/api/recognize",
    json={
        "model": "dslim/bert-base-NER",
        "texts": [
            "John Smith works at Google.",
            "Apple Inc. is in Cupertino."
        ],
        "labels": [
            "person",
            "company",
            "product",
            "date"
        ],
        "relation_labels": [
            "founded",
            "works_at",
            "located_in"
        ]
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/recognize", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "model": "dslim/bert-base-NER",
    "texts": [
      "John Smith works at Google.",
      "Apple Inc. is in Cupertino."
    ],
    "labels": [
      "person",
      "company",
      "product",
      "date"
    ],
    "relation_labels": [
      "founded",
      "works_at",
      "located_in"
    ]
  })
});

const data = await response.json();

Responses#

{
  "model": "string",
  "entities": [
    [
      {
        "text": "John Smith",
        "label": "PER",
        "start": 0,
        "end": 10,
        "score": 0.99
      }
    ]
  ],
  "relations": [
    [
      {
        "head": {
          "text": "John Smith",
          "label": "PER",
          "start": 0,
          "end": 10,
          "score": 0.99
        },
        "tail": {
          "text": "John Smith",
          "label": "PER",
          "start": 0,
          "end": 10,
          "score": 0.99
        },
        "label": "founded",
        "score": 0.95
      }
    ]
  ]
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

Rewrite text using Seq2Seq models#

POST/rewrite

Rewrite/transform text using Seq2Seq models (T5, FLAN-T5, BART, etc.).

Models#

Models are auto-discovered from models_dir/rewriters/
Seq2Seq models have encoder.onnx, decoder-init.onnx, and decoder.onnx files
Compatible with LMQG question generation models

Use Cases#

Question Generation: Generate questions from answer-context pairs
Query Generation: Generate search queries from documents
Paraphrasing: Rewrite text in different words
Translation: Translate text between languages

Example#

For question generation with LMQG models:

{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": ["generate question: <hl> Beyonce <hl> Beyonce starred as Etta James in Cadillac Records."]
}

Request Body#

Example:

{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": [
    "Translate to German: Hello, how are you?"
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/rewrite" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": [
    "Translate to German: Hello, how are you?"
  ]
}'

import requests

response = requests.post(
    "http://localhost:8082/api/rewrite",
    json={
        "model": "lmqg/flan-t5-small-squad-qg",
        "inputs": [
            "Translate to German: Hello, how are you?"
        ]
    }
)

data = response.json()

const response = await fetch("http://localhost:8082/api/rewrite", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    "model": "lmqg/flan-t5-small-squad-qg",
    "inputs": [
      "Translate to German: Hello, how are you?"
    ]
  })
});

const data = await response.json();

Responses#

{
  "model": "string",
  "texts": [
    [
      "string"
    ]
  ]
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

{
  "error": "string"
}

List available models#

GET/models

Returns lists of available embedding, chunking, reranking, generator, NER, and rewriter models.

Embedders#

ONNX models from models_dir/embedders/
Quantized variants have -i8 suffix

Chunkers#

Always includes "fixed" (built-in)
Plus any ONNX models from models_dir/chunkers/

Rerankers#

ONNX models from models_dir/rerankers/
Empty if no models configured

Generators#

LLM models from models_dir/generators/
Empty if no models configured

Recognizers#

ONNX models from models_dir/recognizers/
Includes GLiNER models for zero-shot recognition

Rewriters#

Seq2Seq models from models_dir/rewriters/
T5, FLAN-T5, BART, and LMQG question generation models

Models are discovered at service startup and cached.

Code Examples#

curl -X GET "http://localhost:8082/api/models"

import requests

response = requests.get("http://localhost:8082/api/models")

data = response.json()

const response = await fetch("http://localhost:8082/api/models", {
  method: "GET"
});

const data = await response.json();

Responses#

{
  "chunkers": [
    "fixed",
    "mirth/chonky-mmbert-small-multilingual-1"
  ],
  "rerankers": [
    "BAAI/bge-reranker-v2-m3"
  ],
  "embedders": [
    "BAAI/bge-small-en-v1.5",
    "BAAI/bge-small-en-v1.5:i8"
  ],
  "generators": [
    "google/gemma-3-1b-it",
    "onnxruntime/Gemma-3-ONNX"
  ],
  "recognizers": [
    "dslim/bert-base-NER",
    "dslim/bert-large-NER",
    "onnx-community/gliner_small-v2.1"
  ],
  "extractors": [
    "onnx-community/gliner_small-v2.1",
    "onnx-community/gliner-multitask"
  ],
  "rewriters": [
    "lmqg/flan-t5-small-squad-qg",
    "lmqg/flan-t5-base-squad-qg"
  ],
  "recognizer_info": {
    "dslim/bert-base-NER": {
      "capabilities": [
        "labels"
      ]
    },
    "onnx-community/gliner_small-v2.1": {
      "capabilities": [
        "labels",
        "zeroshot"
      ]
    },
    "onnx-community/gliner-multitask": {
      "capabilities": [
        "labels",
        "zeroshot",
        "relations",
        "answers"
      ]
    }
  }
}

{
  "error": "string"
}

{
  "error": "string"
}

Get version information#

GET/version

Returns Termite version, git commit, build time, and Go runtime version.

Code Examples#

curl -X GET "http://localhost:8082/api/version"

import requests

response = requests.get("http://localhost:8082/api/version")

data = response.json()

const response = await fetch("http://localhost:8082/api/version", {
  method: "GET"
});

const data = await response.json();

Responses#

{
  "version": "v1.0.0",
  "git_commit": "abc1234",
  "build_time": "2024-01-15T10:30:00Z",
  "go_version": "go1.25.0"
}

{
  "error": "string"
}