This guide covers deploying Antfly clusters on any Kubernetes distribution, including minikube, kind, on-premises clusters, and other cloud providers.

Overview#

The Antfly Operator works on any Kubernetes cluster that meets the basic requirements. This guide is for deployments that don't use GKE Autopilot or AWS EKS-specific features.

Prerequisites#

  • Kubernetes 1.20+ cluster
  • kubectl installed and configured
  • Storage class with dynamic provisioning
  • (Optional) metrics-server for autoscaling

Supported Distributions#

The operator has been tested on:

DistributionNotes
minikubeLocal development
kindLocal testing
k3sLightweight production
kubeadmStandard Kubernetes
RancherEnterprise Kubernetes
OpenShiftMay require SCC configuration
DigitalOcean KubernetesWorks out of the box
Linode Kubernetes EngineWorks out of the box
Azure AKSWorks out of the box

Installation#

# Install the operator
kubectl apply -f https://antfly.io/antfly-operator-install.yaml

# Verify installation
kubectl get pods -n antfly-operator-namespace

Basic Deployment#

Minimal Cluster#

For local development or testing:

apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
  name: dev-cluster
  namespace: default
spec:
  image: ghcr.io/antflydb/antfly:latest

  metadataNodes:
    replicas: 1  # Single node for development
    metadataAPI:
      port: 12377
    metadataRaft:
      port: 9017
    resources:
      cpu: "100m"
      memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

  dataNodes:
    replicas: 1  # Single node for development
    api:
      port: 12380
    raft:
      port: 9021
    resources:
      cpu: "100m"
      memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

  storage:
    storageClass: "standard"  # Use your cluster's default
    metadataStorage: "500Mi"
    dataStorage: "1Gi"

  config: |
    {
      "log": {
        "level": "debug",
        "style": "terminal"
      }
    }

Production Cluster#

For production environments:

apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
  name: prod-cluster
  namespace: production
spec:
  image: ghcr.io/antflydb/antfly:latest

  metadataNodes:
    replicas: 3
    metadataAPI:
      port: 12377
    metadataRaft:
      port: 9017
    resources:
      cpu: "500m"
      memory: "512Mi"
      limits:
        cpu: "1000m"
        memory: "1Gi"

  dataNodes:
    replicas: 3
    api:
      port: 12380
    raft:
      port: 9021
    resources:
      cpu: "1000m"
      memory: "2Gi"
      limits:
        cpu: "2000m"
        memory: "4Gi"
    autoScaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 10
      targetCPUUtilizationPercentage: 70

  storage:
    storageClass: "fast-ssd"  # Use your high-performance storage class
    metadataStorage: "5Gi"
    dataStorage: "50Gi"

  publicAPI:
    enabled: true
    serviceType: LoadBalancer
    port: 80

  config: |
    {
      "log": {
        "level": "info",
        "style": "json"
      },
      "enable_metrics": true,
      "replication_factor": 3
    }

Storage Configuration#

Determine Available Storage Classes#

kubectl get storageclass

Common Storage Classes#

ProviderStorage ClassDescription
minikubestandardDefault hostPath storage
kindstandardDefault local storage
k3slocal-pathLocal path provisioner
DigitalOceando-block-storageBlock storage
Linodelinode-block-storageBlock storage
Azure AKS < 1.29managed-csi or managed-csi-premiumPremium SSD with WaitForFirstConsumer (LRS, AZ-bound)
Azure AKS >= 1.29managed-csi (default)Multi-zone clusters auto-use ZRS (zone-redundant) — AZ topology issue eliminated

Multi-AZ Storage Considerations#

For multi-AZ deployments with zone-bound storage (EBS, GCE PD, Azure Disk LRS), verify your StorageClass uses WaitForFirstConsumer binding mode:

kubectl get storageclass <name> -o yaml | grep volumeBindingMode

If it shows Immediate, volumes may be provisioned in a different AZ than your pods, causing volume node affinity conflict errors. See the Troubleshooting guide for details.

ProviderRecommended StorageClassvolumeBindingModeNotes
EKS < 1.30gp3 (custom) or gp2WaitForFirstConsumerMust use ebs.csi.aws.com provisioner for gp3
EKS >= 1.30gp3 (must create)WaitForFirstConsumerNo default StorageClass on EKS 1.30+
GKE Standardstandard-rwo or premium-rwoWaitForFirstConsumerDefault standard uses Immediate — avoid for multi-AZ
GKE Autopilotstandard-rwo (default)WaitForFirstConsumerAutopilot handles topology internally
AKS < 1.29managed-csiWaitForFirstConsumerLRS disks are AZ-bound
AKS >= 1.29managed-csiWaitForFirstConsumerZRS for multi-zone — AZ problem eliminated
GenericMust verifyMust be WaitForFirstConsumerCheck with kubectl get sc <name> -o yaml

Using Custom Storage Class#

spec:
  storage:
    storageClass: "your-storage-class"
    metadataStorage: "1Gi"
    dataStorage: "10Gi"

Local Development#

minikube Setup#

# Start minikube with sufficient resources
minikube start --cpus=4 --memory=7500 --disk-size=50g

# Enable metrics-server for autoscaling
minikube addons enable metrics-server

# Install operator
kubectl apply -f https://antfly.io/antfly-operator-install.yaml

# Deploy development cluster (600m total CPU: 3x100m metadata + 3x100m data)
kubectl apply -f https://antfly.io/examples/development-cluster.yaml

# Access the cluster
kubectl port-forward svc/antfly-dev-cluster-metadata -n antfly-dev 12377:12377

Docker Desktop users: If using --memory values above 7500, ensure Docker Desktop's memory allocation is set to at least 8GB in Docker Desktop Settings > Resources.

Minikube Docker driver: With the Docker driver, NodePort services are not directly accessible from the host. Use kubectl port-forward (recommended for development), minikube service <service-name> to open in a browser, or minikube tunnel for LoadBalancer external IPs.

kind Setup#

# Create cluster
kind create cluster --name antfly-dev

# Install metrics-server (optional)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch metrics-server for kind (insecure TLS)
kubectl patch deployment metrics-server -n kube-system --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

# Install operator
kubectl apply -f https://antfly.io/antfly-operator-install.yaml

# Deploy cluster
kubectl apply -f examples/small-dev-cluster.yaml

k3s Setup#

# Install k3s (already includes metrics-server)
curl -sfL https://get.k3s.io | sh -

# Configure kubectl
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

# Install operator
kubectl apply -f https://antfly.io/antfly-operator-install.yaml

# Deploy cluster
kubectl apply -f examples/production-cluster.yaml

Service Exposure#

ClusterIP (Internal Only)#

spec:
  publicAPI:
    enabled: true
    serviceType: ClusterIP
    port: 80

Access via port-forward:

kubectl port-forward svc/<cluster>-public-api 8080:80

NodePort#

spec:
  publicAPI:
    enabled: true
    serviceType: NodePort
    port: 80
    nodePort: 30100  # Optional: specify port (30000-32767)

Access via any node IP:

curl http://<node-ip>:30100

LoadBalancer#

spec:
  publicAPI:
    enabled: true
    serviceType: LoadBalancer
    port: 80

Works on cloud providers with LoadBalancer support. For bare metal, use MetalLB:

# Install MetalLB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.7/config/manifests/metallb-native.yaml

# Configure IP pool (example)
kubectl apply -f - <<EOF
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.1.240-192.168.1.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
EOF

Ingress#

For custom Ingress configuration, disable the public API service:

spec:
  publicAPI:
    enabled: false

Then create your own Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: antfly-ingress
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  rules:
  - host: antfly.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-cluster-metadata
            port:
              number: 12377

Resource Considerations#

Minimum Requirements#

Node TypeCPU RequestMemory Request
Metadata100m128Mi
Data100m256Mi
Node TypeCPU RequestMemory RequestCPU LimitMemory Limit
Metadata500m512Mi1000m1Gi
Data1000m2Gi2000m4Gi

Resource Quotas#

If your namespace has resource quotas, ensure sufficient allocation:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: antfly-quota
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    persistentvolumeclaims: "10"

Autoscaling#

Prerequisites#

Install metrics-server if not present:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify metrics are available:

kubectl top nodes
kubectl top pods

Enable Autoscaling#

spec:
  dataNodes:
    replicas: 3
    resources:
      cpu: "500m"       # Required for CPU-based scaling
      memory: "1Gi"     # Required for memory-based scaling
    autoScaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 10
      targetCPUUtilizationPercentage: 70
      targetMemoryUtilizationPercentage: 80

Network Policies#

If your cluster uses NetworkPolicies, allow traffic between pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-antfly-internal
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: antfly
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: antfly
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: antfly

Pod Security#

Pod Security Standards#

For clusters with Pod Security Standards (PSS), the Antfly containers run as non-root by default and should work with restricted policy.

OpenShift Security Context Constraints#

For OpenShift, create an SCC:

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: antfly-scc
allowPrivilegedContainer: false
runAsUser:
  type: MustRunAsNonRoot
seLinuxContext:
  type: MustRunAs
fsGroup:
  type: MustRunAs
volumes:
  - configMap
  - emptyDir
  - persistentVolumeClaim
  - secret

Troubleshooting#

Storage Issues#

# Check PVCs
kubectl get pvc -l app=antfly

# Check storage provisioner
kubectl get pods -n kube-system | grep -E "(provisioner|csi)"

# Describe PVC for errors
kubectl describe pvc <pvc-name>

Pods Not Scheduling#

# Check pod events
kubectl describe pod <pod-name>

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check resource quotas
kubectl describe resourcequota

Networking Issues#

# Test pod connectivity
kubectl exec -it <metadata-pod> -- ping <data-pod-ip>

# Check services
kubectl get svc -l app=antfly
kubectl describe svc <service-name>

# Check endpoints
kubectl get endpoints -l app=antfly

Metrics Server Issues#

# Check metrics-server
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system -l k8s-app=metrics-server

# Test metrics API
kubectl top pods

Best Practices#

  1. Use Namespaces: Isolate Antfly clusters in dedicated namespaces
  2. Set Resource Limits: Prevent runaway resource consumption
  3. Enable Autoscaling: For production workloads with variable load
  4. Configure Storage: Use appropriate storage class for your workload
  5. Monitor Resources: Set up monitoring for cluster health
  6. Backup Regularly: Configure AntflyBackup for data protection
  7. Test Failover: Verify high availability works as expected

Example Configurations#

See the examples/ directory for ready-to-use configurations:

ExampleUse Case
small-dev-cluster.yamlMinimal resources for development
development-cluster.yamlDevelopment with debug logging
production-cluster.yamlProduction-ready configuration
autoscaling-cluster.yamlWith autoscaling enabled

Next Steps#