Common issues and solutions for the Antfly Operator.

Quick Diagnostics#

Run these commands to gather diagnostic information:

# Check operator status
kubectl get pods -n antfly-operator-namespace
kubectl logs -n antfly-operator-namespace deployment/antfly-operator --tail=100

# Check cluster status
kubectl get antflycluster -A
kubectl describe antflycluster <name> -n <namespace>

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check pods
kubectl get pods -n <namespace> -l app.kubernetes.io/name=antfly-database

Operator Issues#

Operator Not Starting#

Symptoms: Operator pod is not running or is in CrashLoopBackOff.

Check:

kubectl get pods -n antfly-operator-namespace
kubectl describe pod -n antfly-operator-namespace -l app.kubernetes.io/name=antfly-operator
kubectl logs -n antfly-operator-namespace -l app.kubernetes.io/name=antfly-operator

Common causes:

IssueSolution
ImagePullBackOffCheck image name, registry access, pull secrets
CrashLoopBackOffCheck logs for errors, verify CRDs installed
Insufficient resourcesIncrease node resources or operator limits

RBAC Permission Errors#

Symptoms: Operator logs show "forbidden" errors.

Example:

poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:antfly-operator-namespace:WRONG-NAME" cannot list resource

Solution:

  1. Verify ServiceAccount name matches:

    kubectl get deployment antfly-operator -n antfly-operator-namespace \
      -o jsonpath='{.spec.template.spec.serviceAccountName}'
  2. Check ClusterRoleBinding:

    kubectl get clusterrolebinding antfly-operator-cluster-role-binding -o yaml
  3. Test permissions:

    kubectl auth can-i list poddisruptionbudgets \
      --as=system:serviceaccount:antfly-operator-namespace:antfly-operator-service-account

See RBAC for detailed RBAC configuration.

CRDs Not Found#

Symptoms: error: the server doesn't have a resource type "antflycluster"

Solution:

# Check if CRDs are installed
kubectl get crd | grep antfly

# Reinstall if missing
kubectl apply -f https://antfly.io/antfly-operator-install.yaml

Cluster Issues#

Cluster Stuck in Pending#

Symptoms: AntflyCluster shows Phase: Pending for extended time.

Check:

kubectl describe antflycluster <name>
kubectl get pods -l app.kubernetes.io/name=antfly-database -o wide
kubectl get events --field-selector involvedObject.name=<name>

Common causes:

IssueSolution
Insufficient resourcesAdd nodes or reduce resource requests
Storage class issuesVerify storage class exists and provisions
Image pull issuesCheck image name and registry access
Secret not foundCreate referenced secrets

Pods Not Scheduling#

Symptoms: Pods stuck in Pending state.

Check:

kubectl describe pod <pod-name>
kubectl get events --field-selector involvedObject.name=<pod-name>

Common causes:

CauseMessageSolution
No nodes0/3 nodes are availableAdd nodes, reduce requests
Taintsnode(s) had taints that the pod didn't tolerateAdd tolerations or untaint nodes
Affinitynode(s) didn't match Pod's node affinityFix affinity rules
ResourcesInsufficient cpu/memoryScale cluster or reduce requests

Pods in CrashLoopBackOff#

Symptoms: Pods restart repeatedly.

Check:

kubectl logs <pod-name> --previous
kubectl describe pod <pod-name>

Common causes:

IssueSolution
Configuration errorCheck spec.config JSON validity
Port conflictVerify ports are not in use
Storage issuesCheck PVC binding and permissions
OOMKilledIncrease memory limits

CreateContainerConfigError#

Symptoms: Pods stuck with CreateContainerConfigError.

Check:

kubectl describe pod <pod-name>

Common cause: Secret referenced in envFrom doesn't exist.

Solution:

# Check referenced secrets
kubectl get antflycluster <name> -o jsonpath='{.spec.dataNodes.envFrom}'

# Create missing secret
kubectl create secret generic backup-credentials \
  --from-literal=AWS_ACCESS_KEY_ID='...' \
  --from-literal=AWS_SECRET_ACCESS_KEY='...'

Configuration Validation Failed#

Symptoms: Cluster shows ConfigurationValid: False.

Check:

kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="ConfigurationValid")'

Common causes:

ReasonFix
ConflictingSettingsRemove useSpotPods when autopilot=true
InvalidComputeClassUse valid compute class value
InvalidEBSVolumeTypeUse valid EBS volume type
ImmutableFieldChangedDelete and recreate cluster

Storage Issues#

PVC/AZ Topology Mismatch#

Symptoms: Pods stuck in Pending with volume node affinity conflict. The StorageHealthy condition on the AntflyCluster shows False with reason PVCAZMismatch.

Root cause: PersistentVolumes backed by zone-bound storage (EBS, GCE PD, Azure Disk LRS) are tied to the availability zone where they were provisioned. If a node autoscaler creates nodes in a different AZ than existing PVCs, pods cannot mount their volumes.

Check:

# Check StorageHealthy condition
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="StorageHealthy")'

# Check pod events for volume affinity issues
kubectl describe pod <pending-pod-name>
# Look for: "volume node affinity conflict"

# Check which AZ the PVC's PV is in
kubectl get pv $(kubectl get pvc <pvc-name> -o jsonpath='{.spec.volumeName}') -o jsonpath='{.spec.nodeAffinity}'

Solutions:

  1. Verify StorageClass uses WaitForFirstConsumer:

    kubectl get storageclass <name> -o yaml | grep volumeBindingMode

    If it shows Immediate, switch to a StorageClass with WaitForFirstConsumer. See the cross-cloud StorageClass table below.

  2. Delete stale PVCs and let new ones be provisioned:

    # Scale down the StatefulSet first
    kubectl scale statefulset <name> --replicas=0
    # Delete the mismatched PVCs
    kubectl delete pvc <pvc-name>
    # Scale back up — new PVCs will be provisioned in the correct AZ
    kubectl scale statefulset <name> --replicas=3
  3. Use Karpenter instead of cluster-autoscaler on EKS — Karpenter can be configured with explicit AZ topology requirements, avoiding the ASG-from-zero AZ mismatch entirely. See AWS EKS.

Cross-cloud StorageClass reference:

ProviderRecommended StorageClassvolumeBindingModeNotes
EKS < 1.30gp3 (custom) or default gp2WaitForFirstConsumerMust use ebs.csi.aws.com provisioner for gp3
EKS >= 1.30gp3 (custom, must create)WaitForFirstConsumerNo default StorageClass on EKS 1.30+
GKE Standardstandard-rwo or premium-rwoWaitForFirstConsumerDefault standard uses Immediate — do NOT use for multi-AZ
GKE Autopilotstandard-rwo (default)WaitForFirstConsumerAutopilot handles topology internally
AKS < 1.29managed-csi or managed-csi-premiumWaitForFirstConsumerLRS disks are AZ-bound
AKS >= 1.29managed-csi (default)WaitForFirstConsumerMulti-zone clusters auto-use ZRS — AZ problem eliminated
GenericMust verifyMust be WaitForFirstConsumerCheck with kubectl get sc <name> -o yaml

Stale PVCs After Cluster Recreation#

Symptoms: After deleting an AntflyCluster and recreating one with the same name, pods go Pending with volume node affinity conflict or bind to PVCs containing data from the old cluster.

Root cause: Kubernetes retains PVCs by default after StatefulSet deletion. When a new cluster reuses the same name, the new StatefulSet binds to old PVCs that may be in different AZs or contain stale data.

Solutions:

  1. Use pvcRetentionPolicy.whenDeleted: Delete to automatically clean up PVCs on cluster deletion:

    spec:
      storage:
        pvcRetentionPolicy:
          whenDeleted: Delete
          whenScaled: Retain
  2. Manually delete PVCs before recreating:

    kubectl delete pvc -l app.kubernetes.io/name=antfly-database,app.kubernetes.io/instance=<cluster-name>
  3. Use a different cluster name when recreating to avoid binding to old PVCs.

Stuck Finalizer#

Symptoms: AntflyCluster deletion hangs. The resource has a antfly.io/pvc-cleanup finalizer that is not being removed.

Root cause: The finalizer-based cleanup (cleanupStorageResources) deletes StatefulSets, waits for pods to terminate, then deletes PVCs. If this process gets stuck (e.g., pod stuck in Terminating, PVC stuck in Released), the finalizer prevents CR deletion.

Solution: Manually remove the finalizer:

kubectl edit antflycluster <name>
# Remove "antfly.io/pvc-cleanup" from metadata.finalizers

Then manually clean up any remaining resources:

kubectl delete statefulset <name>-metadata <name>-data
kubectl delete pvc -l app.kubernetes.io/name=antfly-database,app.kubernetes.io/instance=<name>

PVCs Not Binding#

Symptoms: PVCs stuck in Pending state.

Check:

kubectl get pvc -l app.kubernetes.io/name=antfly-database
kubectl describe pvc <pvc-name>
kubectl get storageclass

Solutions:

  1. Verify storage class exists:

    kubectl get storageclass <storage-class-name>
  2. Check for provisioner issues:

    kubectl get pods -n kube-system | grep -E "(provisioner|csi)"
  3. Use default storage class:

    spec:
      storage:
        storageClass: ""  # Use cluster default

Storage Quota Exceeded#

Symptoms: PVC creation fails with quota error.

Check:

kubectl describe resourcequota -n <namespace>

Solution: Increase quota or reduce storage requests.

Networking Issues#

Services Not Accessible#

Symptoms: Cannot connect to cluster services.

Check:

kubectl get svc -l app.kubernetes.io/name=antfly-database
kubectl get endpoints -l app.kubernetes.io/name=antfly-database

Solutions:

  1. Check endpoints have addresses:

    kubectl get endpoints <service-name>
  2. Verify pods are ready:

    kubectl get pods -l app.kubernetes.io/name=antfly-database -o wide
  3. Test connectivity from another pod:

    kubectl run debug --rm -it --image=busybox -- nc -zv <service-name> <port>

LoadBalancer Pending#

Symptoms: External IP shows <pending>.

Check:

kubectl describe svc <cluster>-public-api

Solutions:

EnvironmentSolution
CloudCheck cloud provider quotas and permissions
On-premisesInstall MetalLB or use NodePort
minikubeRun minikube tunnel (for LoadBalancer) or minikube service <service-name> (for NodePort)
kindUse NodePort or port-forward

Minikube Docker Driver Access#

Symptoms: Services are not accessible from the host when using Minikube with the Docker driver. NodePort services cannot be reached via localhost:<nodePort>.

With Minikube's Docker driver, the Kubernetes node runs inside a Docker container, so NodePort services are not directly accessible on the host network.

Solutions (in order of simplicity):

  1. kubectl port-forward (simplest, works with any driver):

    kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
  2. minikube service (opens browser automatically):

    minikube service <service-name> -n <namespace>
  3. minikube tunnel (assigns external IPs to LoadBalancer services):

    minikube tunnel

    This runs in the foreground and requires sudo access. It assigns real external IPs to LoadBalancer-type services.

Autoscaling Issues#

Autoscaling Not Working#

Symptoms: Replicas don't scale despite high utilization.

Check:

# Verify metrics-server
kubectl top pods -l app.kubernetes.io/name=antfly-database

# Check autoscaling status
kubectl get antflycluster <name> -o jsonpath='{.status.autoScalingStatus}'

# Check operator logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator | grep -i autoscal

Common causes:

IssueSolution
metrics-server not installedInstall metrics-server
No resource requestsAdd CPU/memory requests to pods
Cooldown period activeWait for cooldown to expire
At max/min replicasAdjust limits

Metrics Not Available#

Symptoms: kubectl top pods returns error.

Solution:

# Install metrics-server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# For kind, add insecure TLS flag
kubectl patch deployment metrics-server -n kube-system --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Backup/Restore Issues#

Backup Failing#

Symptoms: AntflyBackup shows Phase: Failed.

Check:

kubectl describe antflybackup <name>
kubectl logs -l job-name=<backup-job-name>

Common causes:

IssueSolution
Invalid credentialsVerify secret contents
Bucket doesn't existCreate S3/GCS bucket
Network issuesCheck egress rules
TimeoutIncrease backupTimeout

Restore Failing#

Symptoms: AntflyRestore shows Phase: Failed.

Check:

kubectl describe antflyrestore <name>
kubectl get antflyrestore <name> -o jsonpath='{.status.tables}'

Common causes:

IssueSolution
Backup not foundVerify backupId and location
Table existsUse skip_if_exists or overwrite mode
Cluster not readyWait for cluster to be Running
TimeoutIncrease restoreTimeout

Credentials Issues#

Check:

# Verify secret exists
kubectl get secret backup-credentials

# Check SecretsReady condition
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="SecretsReady")'

# Test credentials (from debug pod)
kubectl run debug --rm -it --image=amazon/aws-cli -- aws s3 ls s3://bucket/

Service Mesh Issues#

Partial Sidecar Injection#

Symptoms: ServiceMeshReady: False with PartialInjection.

Check:

kubectl get antflycluster <name> -o jsonpath='{.status.serviceMeshStatus}'

Solutions:

  1. Check mesh control plane:

    # Istio
    istioctl analyze -n <namespace>
    
    # Linkerd
    linkerd check
  2. Verify annotations:

    kubectl get pods -l app.kubernetes.io/name=antfly-database -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations}{"\n"}{end}'
  3. Force pod recreation:

    kubectl rollout restart statefulset/<cluster>-metadata
    kubectl rollout restart statefulset/<cluster>-data

High Latency with Mesh#

Symptoms: Database operations slow after enabling mesh.

Solution: Exclude Raft ports from mesh:

spec:
  serviceMesh:
    annotations:
      traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"

Cloud-Specific Issues#

GKE Autopilot#

Pods pending for extended time:

  • GKE Autopilot provisions nodes on-demand
  • Wait 2-5 minutes for node provisioning
  • Check events for provisioning status

Compute class conflicts:

  • Don't use useSpotPods with autopilot=true
  • Use autopilotComputeClass: "autopilot-spot" instead

AWS EKS#

EBS CSI driver issues:

kubectl get csidriver ebs.csi.aws.com
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

IRSA not working:

# Verify OIDC provider
aws eks describe-cluster --name <cluster> --query "cluster.identity.oidc"

# Test from pod
kubectl exec -it <pod> -- aws sts get-caller-identity

Debugging Commands#

Operator Logs#

# Recent logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator --tail=100

# Follow logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator -f

# Filter for specific cluster
kubectl logs -n antfly-operator-namespace deployment/antfly-operator | grep <cluster-name>

Pod Inspection#

# Full pod details
kubectl describe pod <pod-name>

# Container status
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses}' | jq

# Previous container logs
kubectl logs <pod-name> --previous

Resource Status#

# All Antfly resources
kubectl get antflycluster,antflybackup,antflyrestore -A

# Detailed cluster status
kubectl get antflycluster <name> -o yaml | yq '.status'

# Conditions only
kubectl get antflycluster <name> -o jsonpath='{.status.conditions}' | jq

Network Debugging#

# Service endpoints
kubectl get endpoints -l app.kubernetes.io/name=antfly-database

# DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup <service-name>

# Port connectivity
kubectl run debug --rm -it --image=busybox -- nc -zv <service-name> <port>

Getting Help#

If you can't resolve an issue:

  1. Check existing issues: GitHub Issues

  2. Gather diagnostics:

    kubectl get antflycluster -A -o yaml > cluster-status.yaml
    kubectl logs -n antfly-operator-namespace deployment/antfly-operator > operator-logs.txt
    kubectl get events -A --sort-by='.lastTimestamp' > events.txt
  3. Open a new issue with:

    • Kubernetes version (kubectl version)
    • Operator version
    • Cloud provider (if applicable)
    • Cluster configuration (sanitized)
    • Error messages and logs
    • Steps to reproduce