This guide explains the architecture and key concepts of the Antfly Operator.

Architecture Overview#

The Antfly database uses a two-tier architecture with separate node types for coordination and data storage:

                         ┌─────────────────────────────────────────────────┐
                         │              Kubernetes Cluster                 │
                         │                                                 │
┌─────────────┐          │  ┌──────────────────────────────────────────┐  │
│   Clients   │◄─────────┼──│           Public API Service             │  │
└─────────────┘          │  │         (LoadBalancer/NodePort)          │  │
                         │  └──────────────────┬───────────────────────┘  │
                         │                     │                          │
                         │                     ▼                          │
                         │  ┌──────────────────────────────────────────┐  │
                         │  │         Metadata Nodes (StatefulSet)     │  │
                         │  │  ┌───────┐  ┌───────┐  ┌───────┐        │  │
                         │  │  │Node 0 │◄─│Node 1 │◄─│Node 2 │        │  │
                         │  │  │Leader │  │Follower│ │Follower│        │  │
                         │  │  └───┬───┘  └───┬───┘  └───┬───┘        │  │
                         │  │      │  Raft    │         │             │  │
                         │  │      │Consensus │         │             │  │
                         │  └──────┼──────────┼─────────┼─────────────┘  │
                         │         │          │         │                │
                         │         ▼          ▼         ▼                │
                         │  ┌──────────────────────────────────────────┐  │
                         │  │          Data Nodes (StatefulSet)        │  │
                         │  │  ┌───────┐  ┌───────┐  ┌───────┐  ...   │  │
                         │  │  │Node 0 │  │Node 1 │  │Node 2 │        │  │
                         │  │  └───────┘  └───────┘  └───────┘        │  │
                         │  │         Data Replication                 │  │
                         │  └──────────────────────────────────────────┘  │
                         │                                                 │
                         └─────────────────────────────────────────────────┘

Node Types#

Metadata Nodes#

Metadata nodes handle cluster coordination and client requests:

ResponsibilityDescription
Raft ConsensusMaintain cluster state consistency
Client APIHandle client connections and queries
Cluster CoordinationManage data node membership
Schema ManagementStore table and index definitions

Key Characteristics:

  • Fixed replica count: Always 3 or 5 (odd number for Raft quorum)
  • Not autoscaled: Replica count is static for consensus stability
  • Not recommended for Spot Pods: Raft leader stability is critical
  • Ports: 12377 (API), 9017 (Raft), 4200 (Health)

Data Nodes#

Data nodes store and replicate actual data:

ResponsibilityDescription
Data StorageStore table data on persistent volumes
ReplicationReplicate data across nodes
Query ProcessingExecute queries on local data

Key Characteristics:

  • Autoscalable: Can scale based on CPU/memory metrics
  • Spot-compatible: Safe to use Spot Pods/Instances with 3+ replicas
  • Horizontal scaling: Add nodes to increase capacity
  • Ports: 12380 (API), 9021 (Raft), 4200 (Health)

Custom Resource Definitions (CRDs)#

The operator manages three CRDs:

AntflyCluster#

The primary resource defining a database cluster:

apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
  name: my-cluster
spec:
  image: ghcr.io/antflydb/antfly:latest
  metadataNodes:
    replicas: 3
    resources: {...}
  dataNodes:
    replicas: 3
    resources: {...}
  storage: {...}
  config: |
    {...}

AntflyBackup#

Defines scheduled backup operations:

apiVersion: antfly.io/v1
kind: AntflyBackup
metadata:
  name: daily-backup
spec:
  clusterRef:
    name: my-cluster
  schedule: "0 2 * * *"  # Daily at 2am
  destination:
    location: s3://my-bucket/backups

AntflyRestore#

Defines restore operations:

apiVersion: antfly.io/v1
kind: AntflyRestore
metadata:
  name: restore-from-backup
spec:
  clusterRef:
    name: my-cluster
  source:
    backupId: "backup-20250101-020000"
    location: s3://my-bucket/backups

Kubernetes Resources Created#

When you create an AntflyCluster, the operator creates:

ResourceName PatternPurpose
StatefulSet{cluster}-metadataMetadata nodes
StatefulSet{cluster}-dataData nodes
Service{cluster}-metadataInternal metadata service
Service{cluster}-dataInternal data service
Service{cluster}-public-apiExternal API service
ConfigMap{cluster}-configAntfly configuration
PVCsdata-{cluster}-*-{n}Persistent storage
PDB{cluster}-metadata-pdbPod disruption budget (if enabled)
PDB{cluster}-data-pdbPod disruption budget (if enabled)

Reconciliation Loop#

The operator continuously reconciles the desired state with actual state:

┌──────────────────────────────────────────────────────────────────────┐
│                        Reconciliation Loop                           │
│                                                                      │
│  ┌─────────┐    ┌─────────────┐    ┌─────────────┐    ┌──────────┐ │
│  │  Watch  │───►│   Compare   │───►│   Update    │───►│  Status  │ │
│  │ Events  │    │   Desired   │    │  Resources  │    │  Update  │ │
│  │         │    │  vs Actual  │    │             │    │          │ │
│  └─────────┘    └─────────────┘    └─────────────┘    └──────────┘ │
│       ▲                                                      │      │
│       └──────────────────────────────────────────────────────┘      │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Reconciliation Steps:

  1. Apply default values to cluster spec
  2. Reconcile ConfigMap (Antfly configuration)
  3. Reconcile Services (internal + public API)
  4. Reconcile Metadata StatefulSet
  5. Reconcile Data StatefulSet
  6. Evaluate autoscaling (if enabled)
  7. Update cluster status

Status and Conditions#

The cluster status provides operational information:

status:
  phase: Running
  metadataNodesReady: 3
  dataNodesReady: 3
  conditions:
    - type: ConfigurationValid
      status: "True"
    - type: SecretsReady
      status: "True"
  autoScalingStatus:
    currentReplicas: 3
    desiredReplicas: 3

Phases#

PhaseDescription
PendingCluster is being created
RunningAll nodes are ready
DegradedSome nodes are not ready
FailedCritical error

Condition Types#

ConditionDescription
ConfigurationValidConfiguration passes validation
SecretsReadyReferenced secrets exist
ServiceMeshReadyService mesh sidecars injected (if enabled)

Port Defaults#

The operator uses these default ports:

ServicePortProtocol
Metadata API12377TCP
Metadata Raft9017TCP
Data API12380TCP
Data Raft9021TCP
Health Check4200HTTP
Public API80TCP

Health Checks#

The operator configures health probes for all pods:

ProbeEndpointPurpose
Startup:4200/healthzAllow slow starts
Liveness:4200/healthzRestart unhealthy pods
Readiness:4200/readyzTraffic routing

Probe Configuration:

  • Startup: 30s initial delay, 10s period, 30 failure threshold
  • Liveness: 15s period, 3 failure threshold
  • Readiness: 5s period, 5 failure threshold

Configuration#

Antfly configuration is passed via spec.config as JSON:

spec:
  config: |
    {
      "log": {
        "level": "info",
        "style": "json"
      },
      "enable_metrics": true,
      "replication_factor": 3
    }

The operator:

  1. Parses the user-provided JSON
  2. Merges with auto-generated network configuration
  3. Stores in a ConfigMap
  4. Mounts at /config/config.json in all pods

Scaling#

Manual Scaling#

Update the replica count in the spec:

spec:
  dataNodes:
    replicas: 5  # Changed from 3

Autoscaling#

Enable metrics-based autoscaling for data nodes:

spec:
  dataNodes:
    replicas: 3
    autoScaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 10
      targetCPUUtilizationPercentage: 70

Autoscaling Behavior:

  • Metrics collected every 30 seconds
  • Scale-up: Max 50% increase or +2 replicas
  • Scale-down: Max 25% decrease or -1 replica
  • Cooldown periods prevent flapping

See Autoscaling for details.

Storage#

Each node gets a Persistent Volume Claim:

spec:
  storage:
    storageClass: "standard"    # Use cluster default
    metadataStorage: "1Gi"      # Per metadata node
    dataStorage: "10Gi"         # Per data node

Important Notes:

  • PVCs are retained when pods restart
  • PVCs are retained when scaling down (data is preserved)
  • Storage class must support dynamic provisioning

Cloud Provider Integration#

GKE Autopilot#

spec:
  gke:
    autopilot: true
    autopilotComputeClass: "Balanced"
    podDisruptionBudget:
      enabled: true

See GKE Guide for details.

AWS EKS#

spec:
  eks:
    enabled: true
    useSpotInstances: true
    ebsVolumeType: "gp3"
    irsaRoleARN: "arn:aws:iam::123456789:role/antfly"

See EKS Guide for details.

Next Steps#