This guide covers deploying Antfly clusters on Google Kubernetes Engine (GKE) Autopilot with cost optimization using Spot Pods.
Overview
GKE Autopilot is a fully managed Kubernetes service that handles cluster infrastructure, node management, and scaling. The Antfly Operator has built-in support for GKE Autopilot including:
- Spot Pods: Up to 71% cost savings for fault-tolerant workloads
- Pod Disruption Budgets: Automatic protection during cluster maintenance
- Compute Classes: Optimized node selection for different workload types
- Automatic Scaling: Works seamlessly with GKE's node auto-provisioning
Prerequisites
- Google Cloud account with billing enabled
gcloudCLI installed and configuredkubectlinstalled- A GKE Autopilot cluster
Creating a GKE Autopilot Cluster
# Set your project
gcloud config set project YOUR_PROJECT_ID
# Create an Autopilot cluster
gcloud container clusters create-auto antfly-cluster \
--region=us-central1 \
--release-channel=regular
# Get credentials
gcloud container clusters get-credentials antfly-cluster --region=us-central1Deploying the Operator
# Deploy the Antfly operator
kubectl apply -f https://antfly.io/antfly-operator-install.yaml
# Verify operator is running
kubectl get pods -n antfly-operator-namespaceGKE Configuration
Add the gke section to your AntflyCluster spec:
spec:
gke:
# Enable Autopilot optimizations
autopilot: true
# Specify compute class (optional, defaults to "Balanced")
autopilotComputeClass: "Balanced" # or "autopilot-spot", "Performance", "Scale-Out", "Accelerator"
# Enable Pod Disruption Budgets (recommended)
podDisruptionBudget:
enabled: true
maxUnavailable: 1 # Allow max 1 pod unavailable during maintenanceSpot Pods Configuration
For GKE Autopilot, use compute class for Spot Pods:
spec:
gke:
autopilot: true
autopilotComputeClass: "autopilot-spot" # Spot pods via compute class
# Do NOT set useSpotPods when autopilot=true - they conflictFor standard GKE clusters (non-Autopilot), use useSpotPods:
spec:
gke:
autopilot: false # Standard GKE
metadataNodes:
useSpotPods: false # NOT recommended for metadata nodes
dataNodes:
useSpotPods: true # Safe for data nodes with replicationImportant Notes:
- GKE Autopilot: Use
autopilotComputeClassfor spot pods (NOTuseSpotPods) - Standard GKE: Use
useSpotPodsfield - Metadata nodes: Should NOT use Spot Pods (maintain Raft consensus stability)
- Data nodes: Can safely use Spot Pods with proper replication (3+ replicas)
- Spot Pods can be evicted at any time with 25-second notice
- The operator sets
terminationGracePeriodSeconds: 15automatically - Immutability:
autopilotandautopilotComputeClassfields cannot be changed after deployment
Compute Classes
GKE Autopilot offers different compute classes for workload optimization:
| Compute Class | Use Case | Characteristics |
|---|---|---|
Balanced | Default, general-purpose workloads | Standard CPU/memory ratio (default) |
autopilot-spot | Cost-optimized Spot Pods | Up to 71% savings, preemptible |
Performance | CPU/memory intensive | Higher limits, premium hardware |
Scale-Out | Large-scale distributed workloads | Optimized for horizontal scaling |
Accelerator | GPU/TPU workloads | Requires GPU resources in pod spec |
autopilot | Standard Autopilot scheduling | Default Autopilot behavior |
Default Behavior: If autopilotComputeClass is not specified and autopilot=true, the operator defaults to "Balanced".
Specify in your cluster configuration:
spec:
gke:
autopilotComputeClass: "Balanced"Pod Disruption Budgets
PodDisruptionBudgets protect your cluster during GKE maintenance:
spec:
gke:
podDisruptionBudget:
enabled: true
maxUnavailable: 1 # Recommended: allows rolling updates while maintaining availability
# OR
# minAvailable: 2 # Alternative: ensure minimum pods always availableBest Practices:
- Use
maxUnavailable(recommended by Google) - automatically scales with replica count - For 3-replica clusters,
maxUnavailable: 1ensures 2 pods always available - For larger clusters, adjust based on replication factor and load requirements
Storage Configuration
GKE Autopilot uses specific storage classes:
spec:
storage:
storageClass: "standard-rwo" # Default GKE Autopilot storage class
# OR
# storageClass: "premium-rwo" # For higher IOPS requirements
metadataStorage: "1Gi"
dataStorage: "10Gi"Available Storage Classes:
standard-rwo: Standard persistent disks (balanced pd-balanced) — recommended for multi-AZpremium-rwo: SSD persistent disks (pd-ssd) — recommended for multi-AZ with high IOPS
Warning (GKE Standard only): The default standard StorageClass uses Immediate volume binding mode, which provisions PVs before pods are scheduled. This can place PVs in the wrong AZ, causing volume node affinity conflict errors. Always use standard-rwo or premium-rwo (CSI-based, WaitForFirstConsumer) for multi-AZ deployments.
GKE Autopilot defaults to standard-rwo, which uses WaitForFirstConsumer — no action needed.
Example Deployments
Standard GKE Autopilot Cluster
apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
name: gke-antfly-cluster
namespace: default
spec:
image: ghcr.io/antflydb/antfly:latest
gke:
autopilot: true
podDisruptionBudget:
enabled: true
maxUnavailable: 1
metadataNodes:
replicas: 3
resources:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
dataNodes:
replicas: 3
resources:
cpu: "1000m"
memory: "2Gi"
storage:
storageClass: "standard-rwo"
metadataStorage: "1Gi"
dataStorage: "10Gi"
config: |
{
"log": {
"level": "info",
"style": "json"
},
"enable_metrics": true
}Cost-Optimized with Spot Pods
apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
name: gke-spot-cluster
namespace: default
spec:
image: ghcr.io/antflydb/antfly:latest
gke:
autopilot: true
autopilotComputeClass: "autopilot-spot" # Use compute class for Spot Pods
podDisruptionBudget:
enabled: true
maxUnavailable: 1
metadataNodes:
replicas: 3
resources:
cpu: "500m"
memory: "512Mi"
dataNodes:
replicas: 5 # More replicas for better fault tolerance
resources:
cpu: "1000m"
memory: "2Gi"
# Optional: Autoscaling with Spot Pods
autoScaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
storage:
storageClass: "standard-rwo"
dataStorage: "10Gi"
config: |
{
"log": {
"level": "info",
"style": "json"
},
"replication_factor": 3
}Deploying from Examples
# Standard GKE Autopilot deployment
kubectl apply -f examples/gke-autopilot-cluster.yaml
# Cost-optimized with Spot Pods
kubectl apply -f examples/gke-autopilot-spot-cluster.yaml
# Check cluster status
kubectl get antflyclusters
kubectl describe antflycluster gke-antfly-clusterMonitoring and Verification
Verify Spot Pod Usage
# Check if pods are running on Spot nodes
kubectl get pods -o wide -l app=antfly,role=data
# Check node labels
kubectl get nodes --show-labels | grep gke-spotCheck Pod Disruption Budgets
# List PDBs
kubectl get pdb
# Check PDB details
kubectl describe pdb <cluster-name>-data-pdb
kubectl describe pdb <cluster-name>-metadata-pdbMonitor Resource Usage
# View pod resource usage
kubectl top pods -l app=antfly
# View node resource usage
kubectl top nodesCost Optimization Tips
- Use Spot Pods for Data Nodes: Save up to 71% on compute costs
- Right-size Resources: GKE Autopilot charges for actual resource requests
- Use Standard Storage: Unless you need high IOPS, standard-rwo is more cost-effective
- Enable Autoscaling: Scale down during low-traffic periods
- Use Scale-Out Compute Class: For high replica counts, consider
Scale-Outclass
Migration from useSpotPods to Compute Classes
If you have existing clusters using useSpotPods on GKE Autopilot, migrate to the compute class approach:
Before (Old Configuration - May Not Work)
spec:
gke:
autopilot: true
dataNodes:
useSpotPods: true # This conflicts with Autopilot modeAfter (New Configuration)
spec:
gke:
autopilot: true
autopilotComputeClass: "autopilot-spot" # Use compute class instead
# Do NOT set useSpotPods when autopilot=trueMigration Steps
-
For existing clusters: Delete and recreate (fields are immutable)
kubectl delete antflycluster my-cluster # Update manifest to use autopilotComputeClass kubectl apply -f updated-cluster.yaml -
For new clusters: Use compute classes from the start
-
Validation: The operator will reject configurations that mix
autopilot=truewithuseSpotPods=true
useSpotPods still works for standard GKE clusters (non-Autopilot).
Troubleshooting
Pods Pending with "Unschedulable"
GKE Autopilot automatically provisions nodes, but it may take 2-5 minutes:
# Check pod events
kubectl describe pod <pod-name>
# Check cluster autoscaling status
kubectl get events --sort-by=.metadata.creationTimestampSpot Pod Evictions
Spot Pods can be evicted with 25-second notice:
# Check pod eviction events
kubectl get events --field-selector reason=Evicted
# Verify data node count remains adequate
kubectl get pods -l app=antfly,role=dataThe operator automatically handles evictions by:
- Maintaining minimum replica count
- Restarting evicted pods on available nodes
- Using PodDisruptionBudgets to limit concurrent evictions
Storage Class Issues
# List available storage classes
kubectl get storageclass
# Verify PVC binding
kubectl get pvc -l app=antfly
kubectl describe pvc <pvc-name>Configuration Validation Errors
If the webhook rejects your configuration:
# Check webhook logs
kubectl logs -n antfly-operator-namespace -l app.kubernetes.io/name=antfly-operator | grep -i webhook
# Common errors:
# - "useSpotPods conflicts with autopilot=true"
# - "autopilotComputeClass requires autopilot=true"
# - "invalid compute class value"Best Practices
- Always Enable PodDisruptionBudgets: Protects against excessive disruption during maintenance
- Don't Use Spot Pods for Metadata Nodes: Raft consensus requires stable metadata nodes
- Maintain 3+ Data Replicas with Spot: Ensures availability during evictions
- Set Appropriate Resource Requests: GKE Autopilot bills based on requests
- Use Resource Limits: Prevents pods from consuming excessive resources
- Monitor Spot Eviction Rates: High eviction rates may indicate undersized cluster
- Test Failover: Verify cluster handles Spot evictions gracefully
Pricing Comparison
Example pricing for us-central1 region (estimates):
| Configuration | Monthly Cost* | Savings |
|---|---|---|
| Standard (no Spot) | $250 | Baseline |
| Data nodes on Spot | $100 | 60% |
| All nodes on Spot (not recommended) | $72 | 71% (risky) |
*Estimates for 3 metadata + 5 data nodes, standard resources
Workload Identity
For accessing Google Cloud services (GCS backups), use Workload Identity instead of static credentials:
# Create GCP service account
gcloud iam service-accounts create antfly-backup-sa \
--project=YOUR_PROJECT_ID
# Grant permissions
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
--member="serviceAccount:antfly-backup-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectAdmin"
# Bind to Kubernetes service account
gcloud iam service-accounts add-iam-policy-binding \
antfly-backup-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:YOUR_PROJECT_ID.svc.id.goog[NAMESPACE/K8S_SA_NAME]"Then reference in your cluster:
spec:
serviceAccountName: antfly-k8s-sa # Kubernetes SA bound to GCP SA