Termite Kubernetes Operator#

The Termite Operator provides Kubernetes-native management for Termite inference pools with autoscaling, GPU/TPU support, and intelligent traffic routing.

Overview#

The Termite Operator automates the deployment and management of Termite inference servers on Kubernetes:

  • Autoscaling - Scale inference pools based on request load
  • GPU/TPU Support - Schedule workloads on accelerated hardware
  • Traffic Routing - Route requests to optimal model instances
  • Health Monitoring - Automatic health checks and recovery
  • Model Versioning - Manage multiple model versions

Installation#

Prerequisites#

  • Kubernetes 1.24+
  • kubectl configured with cluster access
  • Helm 3.0+ (optional, for Helm installation)

Install with kubectl#

kubectl apply -f https://github.com/antflydb/termite/releases/latest/download/termite-operator.yaml

Install with Helm#

helm repo add termite https://charts.termite.dev
helm install termite-operator termite/termite-operator

Quick Start#

Create an Inference Pool#

apiVersion: termite.antfly.io/v1
kind: InferencePool
metadata:
  name: embeddings
  namespace: default
spec:
  model: BAAI/bge-small-en-v1.5
  replicas: 2
  resources:
    limits:
      memory: 4Gi
      cpu: 2

Apply the configuration:

kubectl apply -f inference-pool.yaml

Check Status#

kubectl get inferencepool embeddings

Configuration#

InferencePool Spec#

FieldTypeDescription
modelstringModel name to serve (e.g., BAAI/bge-small-en-v1.5)
replicasintNumber of replicas
resourcesResourceRequirementsCPU/memory/GPU limits
autoscalingAutoscalingSpecAutoscaling configuration

Autoscaling#

Enable autoscaling based on request queue depth:

spec:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetQueueDepth: 5

GPU Support#

Schedule on GPU nodes:

spec:
  resources:
    limits:
      nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-t4

Architecture#

The Termite Operator consists of:

  1. Controller - Watches InferencePool resources and reconciles state
  2. Proxy - Routes requests to available model instances
  3. Metrics Server - Collects metrics for autoscaling decisions

Troubleshooting#

Common Issues#

Pods stuck in Pending

  • Check node resources with kubectl describe node
  • Verify GPU drivers are installed if using GPU resources

Model not loading

  • Check pod logs: kubectl logs -l app=termite
  • Verify model name is correct

High latency

  • Enable autoscaling to handle load spikes
  • Consider using quantized models for faster inference

Next Steps#