Architecture

How Termite works

From model loading through multi-backend inference & caching, to distributed Kubernetes deployments with model-aware routing.

End to End

The Inference Pipeline

From document reading through ML inference — the complete Termite processing pipeline.

Prepare
Analyze

The complete ML inference pipeline — from document reading to generated answers.

01

Inference Pipeline

API requests flow through backpressure control into lazy-loaded model registries. The session manager selects the fastest available backend—ONNX Runtime, XLA, or pure Go—then runs tokenization, inference, and decoding with two-tier caching and singleflight deduplication.

Backpressure QueueMulti-BackendTwo-Tier CacheSingleflightONNX RuntimeLazy Loading
Loading diagram…

Ready to get started?

Run Termite locally in one command or deploy distributed inference pools with the Kubernetes operator.

$termite run --models-dir ./models