Common issues and solutions for the Antfly Operator.
Quick Diagnostics
Run these commands to gather diagnostic information:
# Check operator status
kubectl get pods -n antfly-operator-namespace
kubectl logs -n antfly-operator-namespace deployment/antfly-operator --tail=100
# Check cluster status
kubectl get antflycluster -A
kubectl describe antflycluster <name> -n <namespace>
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Check pods
kubectl get pods -n <namespace> -l app.kubernetes.io/name=antfly-databaseOperator Issues
Operator Not Starting
Symptoms: Operator pod is not running or is in CrashLoopBackOff.
Check:
kubectl get pods -n antfly-operator-namespace
kubectl describe pod -n antfly-operator-namespace -l app.kubernetes.io/name=antfly-operator
kubectl logs -n antfly-operator-namespace -l app.kubernetes.io/name=antfly-operatorCommon causes:
| Issue | Solution |
|---|---|
| ImagePullBackOff | Check image name, registry access, pull secrets |
| CrashLoopBackOff | Check logs for errors, verify CRDs installed |
| Insufficient resources | Increase node resources or operator limits |
RBAC Permission Errors
Symptoms: Operator logs show "forbidden" errors.
Example:
poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:antfly-operator-namespace:WRONG-NAME" cannot list resourceSolution:
-
Verify ServiceAccount name matches:
kubectl get deployment antfly-operator -n antfly-operator-namespace \ -o jsonpath='{.spec.template.spec.serviceAccountName}' -
Check ClusterRoleBinding:
kubectl get clusterrolebinding antfly-operator-cluster-role-binding -o yaml -
Test permissions:
kubectl auth can-i list poddisruptionbudgets \ --as=system:serviceaccount:antfly-operator-namespace:antfly-operator-service-account
See RBAC for detailed RBAC configuration.
CRDs Not Found
Symptoms: error: the server doesn't have a resource type "antflycluster"
Solution:
# Check if CRDs are installed
kubectl get crd | grep antfly
# Reinstall if missing
kubectl apply -f https://antfly.io/antfly-operator-install.yamlCluster Issues
Cluster Stuck in Pending
Symptoms: AntflyCluster shows Phase: Pending for extended time.
Check:
kubectl describe antflycluster <name>
kubectl get pods -l app.kubernetes.io/name=antfly-database -o wide
kubectl get events --field-selector involvedObject.name=<name>Common causes:
| Issue | Solution |
|---|---|
| Insufficient resources | Add nodes or reduce resource requests |
| Storage class issues | Verify storage class exists and provisions |
| Image pull issues | Check image name and registry access |
| Secret not found | Create referenced secrets |
Pods Not Scheduling
Symptoms: Pods stuck in Pending state.
Check:
kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>Common causes:
| Cause | Message | Solution |
|---|---|---|
| No nodes | 0/3 nodes are available | Add nodes, reduce requests |
| Taints | node(s) had taints that the pod didn't tolerate | Add tolerations or untaint nodes |
| Affinity | node(s) didn't match Pod's node affinity | Fix affinity rules |
| Resources | Insufficient cpu/memory | Scale cluster or reduce requests |
Pods in CrashLoopBackOff
Symptoms: Pods restart repeatedly.
Check:
kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>Common causes:
| Issue | Solution |
|---|---|
| Configuration error | Check spec.config JSON validity |
| Port conflict | Verify ports are not in use |
| Storage issues | Check PVC binding and permissions |
| OOMKilled | Increase memory limits |
CreateContainerConfigError
Symptoms: Pods stuck with CreateContainerConfigError.
Check:
kubectl describe pod <pod-name>Common cause: Secret referenced in envFrom doesn't exist.
Solution:
# Check referenced secrets
kubectl get antflycluster <name> -o jsonpath='{.spec.dataNodes.envFrom}'
# Create missing secret
kubectl create secret generic backup-credentials \
--from-literal=AWS_ACCESS_KEY_ID='...' \
--from-literal=AWS_SECRET_ACCESS_KEY='...'Configuration Validation Failed
Symptoms: Cluster shows ConfigurationValid: False.
Check:
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="ConfigurationValid")'Common causes:
| Reason | Fix |
|---|---|
ConflictingSettings | Remove useSpotPods when autopilot=true |
InvalidComputeClass | Use valid compute class value |
InvalidEBSVolumeType | Use valid EBS volume type |
ImmutableFieldChanged | Delete and recreate cluster |
Storage Issues
PVC/AZ Topology Mismatch
Symptoms: Pods stuck in Pending with volume node affinity conflict. The StorageHealthy condition on the AntflyCluster shows False with reason PVCAZMismatch.
Root cause: PersistentVolumes backed by zone-bound storage (EBS, GCE PD, Azure Disk LRS) are tied to the availability zone where they were provisioned. If a node autoscaler creates nodes in a different AZ than existing PVCs, pods cannot mount their volumes.
Check:
# Check StorageHealthy condition
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="StorageHealthy")'
# Check pod events for volume affinity issues
kubectl describe pod <pending-pod-name>
# Look for: "volume node affinity conflict"
# Check which AZ the PVC's PV is in
kubectl get pv $(kubectl get pvc <pvc-name> -o jsonpath='{.spec.volumeName}') -o jsonpath='{.spec.nodeAffinity}'Solutions:
-
Verify StorageClass uses
WaitForFirstConsumer:kubectl get storageclass <name> -o yaml | grep volumeBindingModeIf it shows
Immediate, switch to a StorageClass withWaitForFirstConsumer. See the cross-cloud StorageClass table below. -
Delete stale PVCs and let new ones be provisioned:
# Scale down the StatefulSet first kubectl scale statefulset <name> --replicas=0 # Delete the mismatched PVCs kubectl delete pvc <pvc-name> # Scale back up — new PVCs will be provisioned in the correct AZ kubectl scale statefulset <name> --replicas=3 -
Use Karpenter instead of cluster-autoscaler on EKS — Karpenter can be configured with explicit AZ topology requirements, avoiding the ASG-from-zero AZ mismatch entirely. See AWS EKS.
Cross-cloud StorageClass reference:
| Provider | Recommended StorageClass | volumeBindingMode | Notes |
|---|---|---|---|
| EKS < 1.30 | gp3 (custom) or default gp2 | WaitForFirstConsumer | Must use ebs.csi.aws.com provisioner for gp3 |
| EKS >= 1.30 | gp3 (custom, must create) | WaitForFirstConsumer | No default StorageClass on EKS 1.30+ |
| GKE Standard | standard-rwo or premium-rwo | WaitForFirstConsumer | Default standard uses Immediate — do NOT use for multi-AZ |
| GKE Autopilot | standard-rwo (default) | WaitForFirstConsumer | Autopilot handles topology internally |
| AKS < 1.29 | managed-csi or managed-csi-premium | WaitForFirstConsumer | LRS disks are AZ-bound |
| AKS >= 1.29 | managed-csi (default) | WaitForFirstConsumer | Multi-zone clusters auto-use ZRS — AZ problem eliminated |
| Generic | Must verify | Must be WaitForFirstConsumer | Check with kubectl get sc <name> -o yaml |
Stale PVCs After Cluster Recreation
Symptoms: After deleting an AntflyCluster and recreating one with the same name, pods go Pending with volume node affinity conflict or bind to PVCs containing data from the old cluster.
Root cause: Kubernetes retains PVCs by default after StatefulSet deletion. When a new cluster reuses the same name, the new StatefulSet binds to old PVCs that may be in different AZs or contain stale data.
Solutions:
-
Use
pvcRetentionPolicy.whenDeleted: Deleteto automatically clean up PVCs on cluster deletion:spec: storage: pvcRetentionPolicy: whenDeleted: Delete whenScaled: Retain -
Manually delete PVCs before recreating:
kubectl delete pvc -l app.kubernetes.io/name=antfly-database,app.kubernetes.io/instance=<cluster-name> -
Use a different cluster name when recreating to avoid binding to old PVCs.
Stuck Finalizer
Symptoms: AntflyCluster deletion hangs. The resource has a antfly.io/pvc-cleanup finalizer that is not being removed.
Root cause: The finalizer-based cleanup (cleanupStorageResources) deletes StatefulSets, waits for pods to terminate, then deletes PVCs. If this process gets stuck (e.g., pod stuck in Terminating, PVC stuck in Released), the finalizer prevents CR deletion.
Solution: Manually remove the finalizer:
kubectl edit antflycluster <name>
# Remove "antfly.io/pvc-cleanup" from metadata.finalizersThen manually clean up any remaining resources:
kubectl delete statefulset <name>-metadata <name>-data
kubectl delete pvc -l app.kubernetes.io/name=antfly-database,app.kubernetes.io/instance=<name>PVCs Not Binding
Symptoms: PVCs stuck in Pending state.
Check:
kubectl get pvc -l app.kubernetes.io/name=antfly-database
kubectl describe pvc <pvc-name>
kubectl get storageclassSolutions:
-
Verify storage class exists:
kubectl get storageclass <storage-class-name> -
Check for provisioner issues:
kubectl get pods -n kube-system | grep -E "(provisioner|csi)" -
Use default storage class:
spec: storage: storageClass: "" # Use cluster default
Storage Quota Exceeded
Symptoms: PVC creation fails with quota error.
Check:
kubectl describe resourcequota -n <namespace>Solution: Increase quota or reduce storage requests.
Networking Issues
Services Not Accessible
Symptoms: Cannot connect to cluster services.
Check:
kubectl get svc -l app.kubernetes.io/name=antfly-database
kubectl get endpoints -l app.kubernetes.io/name=antfly-databaseSolutions:
-
Check endpoints have addresses:
kubectl get endpoints <service-name> -
Verify pods are ready:
kubectl get pods -l app.kubernetes.io/name=antfly-database -o wide -
Test connectivity from another pod:
kubectl run debug --rm -it --image=busybox -- nc -zv <service-name> <port>
LoadBalancer Pending
Symptoms: External IP shows <pending>.
Check:
kubectl describe svc <cluster>-public-apiSolutions:
| Environment | Solution |
|---|---|
| Cloud | Check cloud provider quotas and permissions |
| On-premises | Install MetalLB or use NodePort |
| minikube | Run minikube tunnel (for LoadBalancer) or minikube service <service-name> (for NodePort) |
| kind | Use NodePort or port-forward |
Minikube Docker Driver Access
Symptoms: Services are not accessible from the host when using Minikube with the Docker driver. NodePort services cannot be reached via localhost:<nodePort>.
With Minikube's Docker driver, the Kubernetes node runs inside a Docker container, so NodePort services are not directly accessible on the host network.
Solutions (in order of simplicity):
-
kubectl port-forward(simplest, works with any driver):kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port> -
minikube service(opens browser automatically):minikube service <service-name> -n <namespace> -
minikube tunnel(assigns external IPs to LoadBalancer services):minikube tunnelThis runs in the foreground and requires
sudoaccess. It assigns real external IPs to LoadBalancer-type services.
Autoscaling Issues
Autoscaling Not Working
Symptoms: Replicas don't scale despite high utilization.
Check:
# Verify metrics-server
kubectl top pods -l app.kubernetes.io/name=antfly-database
# Check autoscaling status
kubectl get antflycluster <name> -o jsonpath='{.status.autoScalingStatus}'
# Check operator logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator | grep -i autoscalCommon causes:
| Issue | Solution |
|---|---|
| metrics-server not installed | Install metrics-server |
| No resource requests | Add CPU/memory requests to pods |
| Cooldown period active | Wait for cooldown to expire |
| At max/min replicas | Adjust limits |
Metrics Not Available
Symptoms: kubectl top pods returns error.
Solution:
# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# For kind, add insecure TLS flag
kubectl patch deployment metrics-server -n kube-system --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'Backup/Restore Issues
Backup Failing
Symptoms: AntflyBackup shows Phase: Failed.
Check:
kubectl describe antflybackup <name>
kubectl logs -l job-name=<backup-job-name>Common causes:
| Issue | Solution |
|---|---|
| Invalid credentials | Verify secret contents |
| Bucket doesn't exist | Create S3/GCS bucket |
| Network issues | Check egress rules |
| Timeout | Increase backupTimeout |
Restore Failing
Symptoms: AntflyRestore shows Phase: Failed.
Check:
kubectl describe antflyrestore <name>
kubectl get antflyrestore <name> -o jsonpath='{.status.tables}'Common causes:
| Issue | Solution |
|---|---|
| Backup not found | Verify backupId and location |
| Table exists | Use skip_if_exists or overwrite mode |
| Cluster not ready | Wait for cluster to be Running |
| Timeout | Increase restoreTimeout |
Credentials Issues
Check:
# Verify secret exists
kubectl get secret backup-credentials
# Check SecretsReady condition
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="SecretsReady")'
# Test credentials (from debug pod)
kubectl run debug --rm -it --image=amazon/aws-cli -- aws s3 ls s3://bucket/Service Mesh Issues
Partial Sidecar Injection
Symptoms: ServiceMeshReady: False with PartialInjection.
Check:
kubectl get antflycluster <name> -o jsonpath='{.status.serviceMeshStatus}'Solutions:
-
Check mesh control plane:
# Istio istioctl analyze -n <namespace> # Linkerd linkerd check -
Verify annotations:
kubectl get pods -l app.kubernetes.io/name=antfly-database -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations}{"\n"}{end}' -
Force pod recreation:
kubectl rollout restart statefulset/<cluster>-metadata kubectl rollout restart statefulset/<cluster>-data
High Latency with Mesh
Symptoms: Database operations slow after enabling mesh.
Solution: Exclude Raft ports from mesh:
spec:
serviceMesh:
annotations:
traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"Cloud-Specific Issues
GKE Autopilot
Pods pending for extended time:
- GKE Autopilot provisions nodes on-demand
- Wait 2-5 minutes for node provisioning
- Check events for provisioning status
Compute class conflicts:
- Don't use
useSpotPodswithautopilot=true - Use
autopilotComputeClass: "autopilot-spot"instead
AWS EKS
EBS CSI driver issues:
kubectl get csidriver ebs.csi.aws.com
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driverIRSA not working:
# Verify OIDC provider
aws eks describe-cluster --name <cluster> --query "cluster.identity.oidc"
# Test from pod
kubectl exec -it <pod> -- aws sts get-caller-identityDebugging Commands
Operator Logs
# Recent logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator --tail=100
# Follow logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator -f
# Filter for specific cluster
kubectl logs -n antfly-operator-namespace deployment/antfly-operator | grep <cluster-name>Pod Inspection
# Full pod details
kubectl describe pod <pod-name>
# Container status
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses}' | jq
# Previous container logs
kubectl logs <pod-name> --previousResource Status
# All Antfly resources
kubectl get antflycluster,antflybackup,antflyrestore -A
# Detailed cluster status
kubectl get antflycluster <name> -o yaml | yq '.status'
# Conditions only
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jqNetwork Debugging
# Service endpoints
kubectl get endpoints -l app.kubernetes.io/name=antfly-database
# DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>
# Port connectivity
kubectl run debug --rm -it --image=busybox -- nc -zv <service-name> <port>Getting Help
If you can't resolve an issue:
-
Check existing issues: GitHub Issues
-
Gather diagnostics:
kubectl get antflycluster -A -o yaml > cluster-status.yaml kubectl logs -n antfly-operator-namespace deployment/antfly-operator > operator-logs.txt kubectl get events -A --sort-by='.lastTimestamp' > events.txt -
Open a new issue with:
- Kubernetes version (
kubectl version) - Operator version
- Cloud provider (if applicable)
- Cluster configuration (sanitized)
- Error messages and logs
- Steps to reproduce
- Kubernetes version (