Termite Kubernetes Operator#

The Termite Operator provides Kubernetes-native management for Termite inference pools with autoscaling, GPU/TPU support, and intelligent traffic routing.

Overview#

The Termite Operator automates the deployment and management of Termite inference servers on Kubernetes:

Autoscaling - Scale inference pools based on request load
GPU/TPU Support - Schedule workloads on accelerated hardware
Traffic Routing - Route requests to optimal model instances
Health Monitoring - Automatic health checks and recovery
Model Versioning - Manage multiple model versions

Installation#

Prerequisites#

Kubernetes 1.24+
kubectl configured with cluster access
Helm 3.0+ (optional, for Helm installation)

Install with kubectl#

kubectl apply -f https://github.com/antflydb/termite/releases/latest/download/termite-operator.yaml

Install with Helm#

helm repo add termite https://charts.termite.dev
helm install termite-operator termite/termite-operator

Quick Start#

Create an Inference Pool#

apiVersion: termite.antfly.io/v1
kind: InferencePool
metadata:
  name: embeddings
  namespace: default
spec:
  model: BAAI/bge-small-en-v1.5
  replicas: 2
  resources:
    limits:
      memory: 4Gi
      cpu: 2

Apply the configuration:

kubectl apply -f inference-pool.yaml

Check Status#

kubectl get inferencepool embeddings

Configuration#

InferencePool Spec#

Field	Type	Description
`model`	string	Model name to serve (e.g., `BAAI/bge-small-en-v1.5`)
`replicas`	int	Number of replicas
`resources`	ResourceRequirements	CPU/memory/GPU limits
`autoscaling`	AutoscalingSpec	Autoscaling configuration

Autoscaling#

Enable autoscaling based on request queue depth:

spec:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetQueueDepth: 5

GPU Support#

Schedule on GPU nodes:

spec:
  resources:
    limits:
      nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-t4

Architecture#

The Termite Operator consists of:

Controller - Watches InferencePool resources and reconciles state
Proxy - Routes requests to available model instances
Metrics Server - Collects metrics for autoscaling decisions

Troubleshooting#

Common Issues#

Pods stuck in Pending

Check node resources with kubectl describe node
Verify GPU drivers are installed if using GPU resources

Model not loading

Check pod logs: kubectl logs -l app=termite
Verify model name is correct

High latency

Enable autoscaling to handle load spikes
Consider using quantized models for faster inference

Next Steps#

API Reference - Use the Termite API
Models - Browse available models
Downloads - Install Termite locally