Termite API#

Termite is an Ollama-like local inference server for ONNX-based ML models.

What is Termite#

Termite provides local ML inference with an Ollama-compatible API:

  • Embedding Generation: Text and multimodal (CLIP) embedding models
  • Text Chunking: Semantic chunking with ONNX models or fixed-size fallback
  • Reranking: Relevance re-scoring for search results
  • Named Entity Recognition: Extract persons, organizations, locations from text
  • Text Rewriting: Transform text using Seq2Seq models (question generation, query generation, etc.)

Download the latest release at https://antfly.io/docs/downloads

When to Use Termite#

Termite can run standalone or as part of an Antfly cluster:

  • Local ONNX model inference without external API dependencies
  • Ollama-compatible /api/embed endpoint for embeddings
  • Semantic text chunking for RAG pipelines
  • Relevance reranking for improved search quality
  • Centralized model serving across distributed nodes
  • Privacy-preserving ML inference (data never leaves your infrastructure)

Features#

Embedding Generation#

  • Models: ONNX models auto-discovered from {models_dir}/embedders/
  • API: Ollama-compatible /api/embed endpoint
  • Response Formats: Binary (default), JSON

Multimodal Support (CLIP)#

  • Image Embeddings: CLIP models for joint text-image embedding space
  • Input Formats: Base64 data URIs (data:image/png;base64,...) or URLs
  • OpenAI-Compatible: Uses content parts format ({"type": "image_url", "image_url": {"url": "..."}})
  • Use Cases: Image search, cross-modal retrieval, visual similarity

Text Chunking#

  • Models: Fixed-size chunking (always available) + ONNX models
  • Model Discovery: Auto-discovers models from {models_dir}/chunkers/
  • Caching: 2-minute TTL memory cache
  • Fallback: Falls back to fixed chunking if model fails

Reranking#

  • Model Discovery: Auto-discovers ONNX models from {models_dir}/rerankers/
  • Quantization: Automatically uses quantized models if available
  • Input: Pre-rendered text prompts (client handles field extraction)

Generate embeddings#

POST/embed

Generates vector embeddings for input content using local ONNX models. This endpoint is compatible with Ollama's /api/embed format for text, and extends it with OpenAI-compatible multimodal support for CLIP models.

Models#

Models are auto-discovered from models_dir/embedders/ at startup. Use the /api/models endpoint to list available models.

  • Text-only models (e.g., BAAI/bge-small-en-v1.5): Accept text strings
  • Multimodal models (e.g., CLIP): Accept text and images via data URIs

Input Formats#

Three formats are supported:

  • Single text string: "hello world"
  • Array of text strings: ["hello", "world"] (Ollama-compatible)
  • Array of content parts: [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}] (OpenAI-compatible)

Caching#

Results are cached in memory for 2 minutes. Concurrent identical requests are deduplicated using singleflight to prevent redundant work.

Response Formats#

Supports multiple content types via Accept header:

  • application/octet-stream: Binary serialization (default, most efficient)
  • application/json: JSON response with model name and embeddings

Examples#

Text embedding (Ollama-compatible):

{
  "model": "BAAI/bge-small-en-v1.5",
  "input": ["hello world", "machine learning"]
}

Multimodal embedding (OpenAI-compatible):

{
  "model": "openai/clip-vit-base-patch32",
  "input": [
    {"type": "text", "text": "a photo of a cat"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0..."}}
  ]
}

Request Body#

Example:

{
  "model": "BAAI/bge-small-en-v1.5",
  "input": [
    "hello world",
    "machine learning"
  ],
  "truncate": true
}

Code Examples#

curl -X POST "http://localhost:8082/api/embed" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "BAAI/bge-small-en-v1.5",
  "input": [
    "hello world",
    "machine learning"
  ],
  "truncate": true
}'

Responses#

{
  "model": "BAAI/bge-small-en-v1.5",
  "embeddings": [
    [
      0.0123,
      -0.0456,
      0.0789
    ],
    [
      0.0234,
      -0.0567,
      0.089
    ]
  ]
}

Chunk text into smaller segments#

POST/chunk

Splits text into smaller chunks using semantic or fixed-size chunking models.

Models#

Fixed Chunking (always available)#

  • Simple token-based splitting with overlap
  • Use model="fixed"
  • Fast and deterministic

ONNX Models#

  • Semantic chunking based on content similarity
  • Models auto-discovered from models_dir/chunkers/
  • Falls back to fixed chunking if model fails

Caching#

Results are cached in memory for 2 minutes. Cache key includes both config and text content.

Example#

{
  "text": "This is a long document...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n"
  }
}

Request Body#

Example:

{
  "text": "This is a long document that needs to be split into smaller chunks...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n",
    "max_chunks": 50,
    "threshold": 0.5
  }
}

Code Examples#

curl -X POST "http://localhost:8082/api/chunk" \
  -H "Content-Type: application/json" \
  -d '{
  "text": "This is a long document that needs to be split into smaller chunks...",
  "config": {
    "model": "fixed",
    "target_tokens": 500,
    "overlap_tokens": 50,
    "separator": "\n\n",
    "max_chunks": 50,
    "threshold": 0.5
  }
}'

Responses#

{
  "chunks": [
    {
      "id": 0,
      "text": "This is the first chunk...",
      "start_char": 0,
      "end_char": 100
    },
    {
      "id": 1,
      "text": "This is the second chunk...",
      "start_char": 90,
      "end_char": 190
    }
  ],
  "model": "fixed",
  "cache_hit": false
}

Rerank prompts by relevance#

POST/rerank

Re-scores pre-rendered text prompts based on relevance to a query using ONNX reranking models.

Client Responsibilities#

The client must:

  1. Extract relevant fields from documents
  2. Render any templates
  3. Send pre-rendered text strings as prompts

This design keeps Termite stateless and allows clients to customize rendering logic.

Models#

  • Models are auto-discovered from models_dir/rerankers/
  • Supports quantized models (model_quantized.onnx)
  • Automatically prefers quantized variants if available

Example#

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to Machine Learning: This guide covers...",
    "Deep Learning Fundamentals: Neural networks are..."
  ]
}

For document-based reranking with field extraction, use the client-side lib/reranking package which handles rendering before calling this endpoint.

Request Body#

Example:

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to machine learning...",
    "Deep learning fundamentals..."
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/rerank" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "machine learning applications",
  "prompts": [
    "Introduction to machine learning...",
    "Deep learning fundamentals..."
  ]
}'

Responses#

{
  "model": "string",
  "scores": [
    0
  ]
}

Generate text using LLM (OpenAI-compatible)#

POST/generate

Generates text using local LLM models (e.g., Gemma 3). Fully compatible with the OpenAI Chat Completions API.

Models#

Models are auto-discovered from models_dir/generators/ at startup. Use the /api/models endpoint to list available models.

Streaming#

Set stream: true to receive Server-Sent Events (SSE) with incremental token deltas. Each event contains a ChatCompletionChunk object. The stream ends with data: [DONE].

Input Format#

Uses OpenAI-compatible chat format with messages array:

{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "stream": false
}

Example (Non-streaming)#

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-1b-it",
    "messages": [{"role": "user", "content": "What is machine learning?"}],
    "max_tokens": 100
  }'

Example (Streaming)#

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-3-1b-it",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Request Body#

Example:

{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {
      "role": "system",
      "content": "string",
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"San Francisco, CA\"}"
          }
        }
      ],
      "tool_call_id": "string"
    }
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0,
  "top_k": 0,
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            }
          },
          "required": [
            "location"
          ]
        },
        "strict": true
      }
    }
  ],
  "tool_choice": "auto"
}

Code Examples#

curl -X POST "http://localhost:8082/api/generate" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "google/gemma-3-1b-it",
  "messages": [
    {
      "role": "system",
      "content": "string",
      "tool_calls": [
        {
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"San Francisco, CA\"}"
          }
        }
      ],
      "tool_call_id": "string"
    }
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "top_p": 0,
  "top_k": 0,
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            }
          },
          "required": [
            "location"
          ]
        },
        "strict": true
      }
    }
  ],
  "tool_choice": "auto"
}'

Responses#

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704123456,
  "model": "string",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "system",
        "content": "string",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco, CA\"}"
            }
          }
        ]
      },
      "finish_reason": "stop",
      "logprobs": {}
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

Recognize named entities#

POST/recognize

Recognizes named entities (persons, organizations, locations, etc.) from text using ONNX recognition models.

Entity Types#

Standard CoNLL entity types:

  • PER: Person names (e.g., "John Smith")
  • ORG: Organizations (e.g., "Google", "Apple Inc.")
  • LOC: Locations (e.g., "New York", "France")
  • MISC: Miscellaneous entities

Models#

  • Models are auto-discovered from models_dir/recognizers/
  • Supports quantized variants (model_i8.onnx)
  • Compatible with HuggingFace BERT-based recognition models
  • GLiNER models support custom entity labels via the labels parameter

Example#

{
  "model": "dslim/bert-base-NER",
  "texts": ["John Smith works at Google.", "Apple Inc. is in Cupertino."]
}

Request Body#

Example:

{
  "model": "dslim/bert-base-NER",
  "texts": [
    "John Smith works at Google.",
    "Apple Inc. is in Cupertino."
  ],
  "labels": [
    "person",
    "company",
    "product",
    "date"
  ],
  "relation_labels": [
    "founded",
    "works_at",
    "located_in"
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/recognize" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "dslim/bert-base-NER",
  "texts": [
    "John Smith works at Google.",
    "Apple Inc. is in Cupertino."
  ],
  "labels": [
    "person",
    "company",
    "product",
    "date"
  ],
  "relation_labels": [
    "founded",
    "works_at",
    "located_in"
  ]
}'

Responses#

{
  "model": "string",
  "entities": [
    [
      {
        "text": "John Smith",
        "label": "PER",
        "start": 0,
        "end": 10,
        "score": 0.99
      }
    ]
  ],
  "relations": [
    [
      {
        "head": {
          "text": "John Smith",
          "label": "PER",
          "start": 0,
          "end": 10,
          "score": 0.99
        },
        "tail": {
          "text": "John Smith",
          "label": "PER",
          "start": 0,
          "end": 10,
          "score": 0.99
        },
        "label": "founded",
        "score": 0.95
      }
    ]
  ]
}

Rewrite text using Seq2Seq models#

POST/rewrite

Rewrite/transform text using Seq2Seq models (T5, FLAN-T5, BART, etc.).

Models#

  • Models are auto-discovered from models_dir/rewriters/
  • Seq2Seq models have encoder.onnx, decoder-init.onnx, and decoder.onnx files
  • Compatible with LMQG question generation models

Use Cases#

  • Question Generation: Generate questions from answer-context pairs
  • Query Generation: Generate search queries from documents
  • Paraphrasing: Rewrite text in different words
  • Translation: Translate text between languages

Example#

For question generation with LMQG models:

{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": ["generate question: <hl> Beyonce <hl> Beyonce starred as Etta James in Cadillac Records."]
}

Request Body#

Example:

{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": [
    "Translate to German: Hello, how are you?"
  ]
}

Code Examples#

curl -X POST "http://localhost:8082/api/rewrite" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "lmqg/flan-t5-small-squad-qg",
  "inputs": [
    "Translate to German: Hello, how are you?"
  ]
}'

Responses#

{
  "model": "string",
  "texts": [
    [
      "string"
    ]
  ]
}

List available models#

GET/models

Returns lists of available embedding, chunking, reranking, generator, NER, and rewriter models.

Embedders#

  • ONNX models from models_dir/embedders/
  • Quantized variants have -i8 suffix

Chunkers#

  • Always includes "fixed" (built-in)
  • Plus any ONNX models from models_dir/chunkers/

Rerankers#

  • ONNX models from models_dir/rerankers/
  • Empty if no models configured

Generators#

  • LLM models from models_dir/generators/
  • Empty if no models configured

Recognizers#

  • ONNX models from models_dir/recognizers/
  • Includes GLiNER models for zero-shot recognition

Rewriters#

  • Seq2Seq models from models_dir/rewriters/
  • T5, FLAN-T5, BART, and LMQG question generation models

Models are discovered at service startup and cached.

Code Examples#

curl -X GET "http://localhost:8082/api/models"

Responses#

{
  "chunkers": [
    "fixed",
    "mirth/chonky-mmbert-small-multilingual-1"
  ],
  "rerankers": [
    "BAAI/bge-reranker-v2-m3"
  ],
  "embedders": [
    "BAAI/bge-small-en-v1.5",
    "BAAI/bge-small-en-v1.5:i8"
  ],
  "generators": [
    "google/gemma-3-1b-it",
    "onnxruntime/Gemma-3-ONNX"
  ],
  "recognizers": [
    "dslim/bert-base-NER",
    "dslim/bert-large-NER",
    "onnx-community/gliner_small-v2.1"
  ],
  "extractors": [
    "onnx-community/gliner_small-v2.1",
    "onnx-community/gliner-multitask"
  ],
  "rewriters": [
    "lmqg/flan-t5-small-squad-qg",
    "lmqg/flan-t5-base-squad-qg"
  ],
  "recognizer_info": {
    "dslim/bert-base-NER": {
      "capabilities": [
        "labels"
      ]
    },
    "onnx-community/gliner_small-v2.1": {
      "capabilities": [
        "labels",
        "zeroshot"
      ]
    },
    "onnx-community/gliner-multitask": {
      "capabilities": [
        "labels",
        "zeroshot",
        "relations",
        "answers"
      ]
    }
  }
}

Get version information#

GET/version

Returns Termite version, git commit, build time, and Go runtime version.

Code Examples#

curl -X GET "http://localhost:8082/api/version"

Responses#

{
  "version": "v1.0.0",
  "git_commit": "abc1234",
  "build_time": "2024-01-15T10:30:00Z",
  "go_version": "go1.25.0"
}