Architecture
How Termite works
From model loading through multi-backend inference & caching, to distributed Kubernetes deployments with model-aware routing.
End to End
The Inference Pipeline
From document reading through ML inference — the complete Termite processing pipeline.
Prepare
Analyze
The complete ML inference pipeline — from document reading to generated answers.
01
Inference Pipeline
API requests flow through backpressure control into lazy-loaded model registries. The session manager selects the fastest available backend—ONNX Runtime, XLA, or pure Go—then runs tokenization, inference, and decoding with two-tier caching and singleflight deduplication.
Backpressure QueueMulti-BackendTwo-Tier CacheSingleflightONNX RuntimeLazy Loading
Loading diagram…
Ready to get started?
Run Termite locally in one command or deploy distributed inference pools with the Kubernetes operator.
$
termite run --models-dir ./models