Termite API
Termite is an Ollama-like local inference server for ONNX-based ML models.
What is Termite
Termite provides local ML inference with an Ollama-compatible API:
- Embedding Generation: Text and multimodal (CLIP) embedding models
- Text Chunking: Semantic chunking with ONNX models or fixed-size fallback
- Reranking: Relevance re-scoring for search results
- Named Entity Recognition: Extract persons, organizations, locations from text
- Text Rewriting: Transform text using Seq2Seq models (question generation, query generation, etc.)
Download the latest release at https://antfly.io/docs/downloads
When to Use Termite
Termite can run standalone or as part of an Antfly cluster:
- Local ONNX model inference without external API dependencies
- Ollama-compatible
/api/embedendpoint for embeddings - Semantic text chunking for RAG pipelines
- Relevance reranking for improved search quality
- Centralized model serving across distributed nodes
- Privacy-preserving ML inference (data never leaves your infrastructure)
Features
Embedding Generation
- Models: ONNX models auto-discovered from
{models_dir}/embedders/ - API: Ollama-compatible
/api/embedendpoint - Response Formats: Binary (default), JSON
Multimodal Support (CLIP)
- Image Embeddings: CLIP models for joint text-image embedding space
- Input Formats: Base64 data URIs (
data:image/png;base64,...) or URLs - OpenAI-Compatible: Uses content parts format (
{"type": "image_url", "image_url": {"url": "..."}}) - Use Cases: Image search, cross-modal retrieval, visual similarity
Text Chunking
- Models: Fixed-size chunking (always available) + ONNX models
- Model Discovery: Auto-discovers models from
{models_dir}/chunkers/ - Caching: 2-minute TTL memory cache
- Fallback: Falls back to fixed chunking if model fails
Reranking
- Model Discovery: Auto-discovers ONNX models from
{models_dir}/rerankers/ - Quantization: Automatically uses quantized models if available
- Input: Pre-rendered text prompts (client handles field extraction)
Generate embeddings
/embedGenerates vector embeddings for input content using local ONNX models.
This endpoint is compatible with Ollama's /api/embed format for text,
and extends it with OpenAI-compatible multimodal support for CLIP models.
Models
Models are auto-discovered from models_dir/embedders/ at startup.
Use the /api/models endpoint to list available models.
- Text-only models (e.g., BAAI/bge-small-en-v1.5): Accept text strings
- Multimodal models (e.g., CLIP): Accept text and images via data URIs
Input Formats
Three formats are supported:
- Single text string:
"hello world" - Array of text strings:
["hello", "world"](Ollama-compatible) - Array of content parts:
[{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}](OpenAI-compatible)
Caching
Results are cached in memory for 2 minutes. Concurrent identical requests are deduplicated using singleflight to prevent redundant work.
Response Formats
Supports multiple content types via Accept header:
application/octet-stream: Binary serialization (default, most efficient)application/json: JSON response with model name and embeddings
Examples
Text embedding (Ollama-compatible):
{
"model": "BAAI/bge-small-en-v1.5",
"input": ["hello world", "machine learning"]
}Multimodal embedding (OpenAI-compatible):
{
"model": "openai/clip-vit-base-patch32",
"input": [
{"type": "text", "text": "a photo of a cat"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0..."}}
]
}Request Body
Example:
{
"model": "BAAI/bge-small-en-v1.5",
"input": [
"hello world",
"machine learning"
],
"truncate": true
}Code Examples
curl -X POST "http://localhost:8082/api/embed" \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-small-en-v1.5",
"input": [
"hello world",
"machine learning"
],
"truncate": true
}'import requests
response = requests.post(
"http://localhost:8082/api/embed",
json={
"model": "BAAI/bge-small-en-v1.5",
"input": [
"hello world",
"machine learning"
],
"truncate": true
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/embed", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"model": "BAAI/bge-small-en-v1.5",
"input": [
"hello world",
"machine learning"
],
"truncate": true
})
});
const data = await response.json();Responses
{
"model": "BAAI/bge-small-en-v1.5",
"embeddings": [
[
0.0123,
-0.0456,
0.0789
],
[
0.0234,
-0.0567,
0.089
]
]
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}Chunk text into smaller segments
/chunkSplits text into smaller chunks using semantic or fixed-size chunking models.
Models
Fixed Chunking (always available)
- Simple token-based splitting with overlap
- Use model="fixed"
- Fast and deterministic
ONNX Models
- Semantic chunking based on content similarity
- Models auto-discovered from
models_dir/chunkers/ - Falls back to fixed chunking if model fails
Caching
Results are cached in memory for 2 minutes. Cache key includes both config and text content.
Example
{
"text": "This is a long document...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n"
}
}Request Body
Example:
{
"text": "This is a long document that needs to be split into smaller chunks...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n",
"max_chunks": 50,
"threshold": 0.5
}
}Code Examples
curl -X POST "http://localhost:8082/api/chunk" \
-H "Content-Type: application/json" \
-d '{
"text": "This is a long document that needs to be split into smaller chunks...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n",
"max_chunks": 50,
"threshold": 0.5
}
}'import requests
response = requests.post(
"http://localhost:8082/api/chunk",
json={
"text": "This is a long document that needs to be split into smaller chunks...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n",
"max_chunks": 50,
"threshold": 0.5
}
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/chunk", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"text": "This is a long document that needs to be split into smaller chunks...",
"config": {
"model": "fixed",
"target_tokens": 500,
"overlap_tokens": 50,
"separator": "\n\n",
"max_chunks": 50,
"threshold": 0.5
}
})
});
const data = await response.json();Responses
{
"chunks": [
{
"id": 0,
"text": "This is the first chunk...",
"start_char": 0,
"end_char": 100
},
{
"id": 1,
"text": "This is the second chunk...",
"start_char": 90,
"end_char": 190
}
],
"model": "fixed",
"cache_hit": false
}{
"error": "string"
}{
"error": "string"
}Rerank prompts by relevance
/rerankRe-scores pre-rendered text prompts based on relevance to a query using ONNX reranking models.
Client Responsibilities
The client must:
- Extract relevant fields from documents
- Render any templates
- Send pre-rendered text strings as
prompts
This design keeps Termite stateless and allows clients to customize rendering logic.
Models
- Models are auto-discovered from
models_dir/rerankers/ - Supports quantized models (
model_quantized.onnx) - Automatically prefers quantized variants if available
Example
{
"model": "BAAI/bge-reranker-v2-m3",
"query": "machine learning applications",
"prompts": [
"Introduction to Machine Learning: This guide covers...",
"Deep Learning Fundamentals: Neural networks are..."
]
}For document-based reranking with field extraction, use the client-side
lib/reranking package which handles rendering before calling this endpoint.
Request Body
Example:
{
"model": "BAAI/bge-reranker-v2-m3",
"query": "machine learning applications",
"prompts": [
"Introduction to machine learning...",
"Deep learning fundamentals..."
]
}Code Examples
curl -X POST "http://localhost:8082/api/rerank" \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"query": "machine learning applications",
"prompts": [
"Introduction to machine learning...",
"Deep learning fundamentals..."
]
}'import requests
response = requests.post(
"http://localhost:8082/api/rerank",
json={
"model": "BAAI/bge-reranker-v2-m3",
"query": "machine learning applications",
"prompts": [
"Introduction to machine learning...",
"Deep learning fundamentals..."
]
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/rerank", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"model": "BAAI/bge-reranker-v2-m3",
"query": "machine learning applications",
"prompts": [
"Introduction to machine learning...",
"Deep learning fundamentals..."
]
})
});
const data = await response.json();Responses
{
"model": "string",
"scores": [
0
]
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}Generate text using LLM (OpenAI-compatible)
/generateGenerates text using local LLM models (e.g., Gemma 3). Fully compatible with the OpenAI Chat Completions API.
Models
Models are auto-discovered from models_dir/generators/ at startup.
Use the /api/models endpoint to list available models.
Streaming
Set stream: true to receive Server-Sent Events (SSE) with incremental
token deltas. Each event contains a ChatCompletionChunk object.
The stream ends with data: [DONE].
Input Format
Uses OpenAI-compatible chat format with messages array:
{
"model": "google/gemma-3-1b-it",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 256,
"stream": false
}Example (Non-streaming)
curl -X POST http://localhost:8080/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"max_tokens": 100
}'Example (Streaming)
curl -X POST http://localhost:8080/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Request Body
Example:
{
"model": "google/gemma-3-1b-it",
"messages": [
{
"role": "system",
"content": "string",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
],
"tool_call_id": "string"
}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0,
"top_k": 0,
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
}
},
"required": [
"location"
]
},
"strict": true
}
}
],
"tool_choice": "auto"
}Code Examples
curl -X POST "http://localhost:8082/api/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-3-1b-it",
"messages": [
{
"role": "system",
"content": "string",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
],
"tool_call_id": "string"
}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0,
"top_k": 0,
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
}
},
"required": [
"location"
]
},
"strict": true
}
}
],
"tool_choice": "auto"
}'import requests
response = requests.post(
"http://localhost:8082/api/generate",
json={
"model": "google/gemma-3-1b-it",
"messages": [
{
"role": "system",
"content": "string",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
],
"tool_call_id": "string"
}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0,
"top_k": 0,
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
}
},
"required": [
"location"
]
},
"strict": true
}
}
],
"tool_choice": "auto"
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"model": "google/gemma-3-1b-it",
"messages": [
{
"role": "system",
"content": "string",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
],
"tool_call_id": "string"
}
],
"max_tokens": 256,
"temperature": 0.7,
"top_p": 0,
"top_k": 0,
"stream": true,
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
}
},
"required": [
"location"
]
},
"strict": true
}
}
],
"tool_choice": "auto"
})
});
const data = await response.json();Responses
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704123456,
"model": "string",
"choices": [
{
"index": 0,
"message": {
"role": "system",
"content": "string",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
]
},
"finish_reason": "stop",
"logprobs": {}
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
}
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}Recognize named entities
/recognizeRecognizes named entities (persons, organizations, locations, etc.) from text using ONNX recognition models.
Entity Types
Standard CoNLL entity types:
- PER: Person names (e.g., "John Smith")
- ORG: Organizations (e.g., "Google", "Apple Inc.")
- LOC: Locations (e.g., "New York", "France")
- MISC: Miscellaneous entities
Models
- Models are auto-discovered from
models_dir/recognizers/ - Supports quantized variants (model_i8.onnx)
- Compatible with HuggingFace BERT-based recognition models
- GLiNER models support custom entity labels via the
labelsparameter
Example
{
"model": "dslim/bert-base-NER",
"texts": ["John Smith works at Google.", "Apple Inc. is in Cupertino."]
}Request Body
Example:
{
"model": "dslim/bert-base-NER",
"texts": [
"John Smith works at Google.",
"Apple Inc. is in Cupertino."
],
"labels": [
"person",
"company",
"product",
"date"
],
"relation_labels": [
"founded",
"works_at",
"located_in"
]
}Code Examples
curl -X POST "http://localhost:8082/api/recognize" \
-H "Content-Type: application/json" \
-d '{
"model": "dslim/bert-base-NER",
"texts": [
"John Smith works at Google.",
"Apple Inc. is in Cupertino."
],
"labels": [
"person",
"company",
"product",
"date"
],
"relation_labels": [
"founded",
"works_at",
"located_in"
]
}'import requests
response = requests.post(
"http://localhost:8082/api/recognize",
json={
"model": "dslim/bert-base-NER",
"texts": [
"John Smith works at Google.",
"Apple Inc. is in Cupertino."
],
"labels": [
"person",
"company",
"product",
"date"
],
"relation_labels": [
"founded",
"works_at",
"located_in"
]
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/recognize", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"model": "dslim/bert-base-NER",
"texts": [
"John Smith works at Google.",
"Apple Inc. is in Cupertino."
],
"labels": [
"person",
"company",
"product",
"date"
],
"relation_labels": [
"founded",
"works_at",
"located_in"
]
})
});
const data = await response.json();Responses
{
"model": "string",
"entities": [
[
{
"text": "John Smith",
"label": "PER",
"start": 0,
"end": 10,
"score": 0.99
}
]
],
"relations": [
[
{
"head": {
"text": "John Smith",
"label": "PER",
"start": 0,
"end": 10,
"score": 0.99
},
"tail": {
"text": "John Smith",
"label": "PER",
"start": 0,
"end": 10,
"score": 0.99
},
"label": "founded",
"score": 0.95
}
]
]
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}Rewrite text using Seq2Seq models
/rewriteRewrite/transform text using Seq2Seq models (T5, FLAN-T5, BART, etc.).
Models
- Models are auto-discovered from
models_dir/rewriters/ - Seq2Seq models have encoder.onnx, decoder-init.onnx, and decoder.onnx files
- Compatible with LMQG question generation models
Use Cases
- Question Generation: Generate questions from answer-context pairs
- Query Generation: Generate search queries from documents
- Paraphrasing: Rewrite text in different words
- Translation: Translate text between languages
Example
For question generation with LMQG models:
{
"model": "lmqg/flan-t5-small-squad-qg",
"inputs": ["generate question: <hl> Beyonce <hl> Beyonce starred as Etta James in Cadillac Records."]
}Request Body
Example:
{
"model": "lmqg/flan-t5-small-squad-qg",
"inputs": [
"Translate to German: Hello, how are you?"
]
}Code Examples
curl -X POST "http://localhost:8082/api/rewrite" \
-H "Content-Type: application/json" \
-d '{
"model": "lmqg/flan-t5-small-squad-qg",
"inputs": [
"Translate to German: Hello, how are you?"
]
}'import requests
response = requests.post(
"http://localhost:8082/api/rewrite",
json={
"model": "lmqg/flan-t5-small-squad-qg",
"inputs": [
"Translate to German: Hello, how are you?"
]
}
)
data = response.json()const response = await fetch("http://localhost:8082/api/rewrite", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
"model": "lmqg/flan-t5-small-squad-qg",
"inputs": [
"Translate to German: Hello, how are you?"
]
})
});
const data = await response.json();Responses
{
"model": "string",
"texts": [
[
"string"
]
]
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}{
"error": "string"
}List available models
/modelsReturns lists of available embedding, chunking, reranking, generator, NER, and rewriter models.
Embedders
- ONNX models from
models_dir/embedders/ - Quantized variants have
-i8suffix
Chunkers
- Always includes "fixed" (built-in)
- Plus any ONNX models from
models_dir/chunkers/
Rerankers
- ONNX models from
models_dir/rerankers/ - Empty if no models configured
Generators
- LLM models from
models_dir/generators/ - Empty if no models configured
Recognizers
- ONNX models from
models_dir/recognizers/ - Includes GLiNER models for zero-shot recognition
Rewriters
- Seq2Seq models from
models_dir/rewriters/ - T5, FLAN-T5, BART, and LMQG question generation models
Models are discovered at service startup and cached.
Code Examples
curl -X GET "http://localhost:8082/api/models"import requests
response = requests.get("http://localhost:8082/api/models")
data = response.json()const response = await fetch("http://localhost:8082/api/models", {
method: "GET"
});
const data = await response.json();Responses
{
"chunkers": [
"fixed",
"mirth/chonky-mmbert-small-multilingual-1"
],
"rerankers": [
"BAAI/bge-reranker-v2-m3"
],
"embedders": [
"BAAI/bge-small-en-v1.5",
"BAAI/bge-small-en-v1.5:i8"
],
"generators": [
"google/gemma-3-1b-it",
"onnxruntime/Gemma-3-ONNX"
],
"recognizers": [
"dslim/bert-base-NER",
"dslim/bert-large-NER",
"onnx-community/gliner_small-v2.1"
],
"extractors": [
"onnx-community/gliner_small-v2.1",
"onnx-community/gliner-multitask"
],
"rewriters": [
"lmqg/flan-t5-small-squad-qg",
"lmqg/flan-t5-base-squad-qg"
],
"recognizer_info": {
"dslim/bert-base-NER": {
"capabilities": [
"labels"
]
},
"onnx-community/gliner_small-v2.1": {
"capabilities": [
"labels",
"zeroshot"
]
},
"onnx-community/gliner-multitask": {
"capabilities": [
"labels",
"zeroshot",
"relations",
"answers"
]
}
}
}{
"error": "string"
}{
"error": "string"
}Get version information
/versionReturns Termite version, git commit, build time, and Go runtime version.
Code Examples
curl -X GET "http://localhost:8082/api/version"import requests
response = requests.get("http://localhost:8082/api/version")
data = response.json()const response = await fetch("http://localhost:8082/api/version", {
method: "GET"
});
const data = await response.json();Responses
{
"version": "v1.0.0",
"git_commit": "abc1234",
"build_time": "2024-01-15T10:30:00Z",
"go_version": "go1.25.0"
}{
"error": "string"
}