Warning: Service mesh integration is experimental. APIs and behavior may change in future releases.

The Antfly Operator provides native support for service mesh integration, enabling automatic mTLS encryption and traffic management for your database clusters.

Overview#

Service mesh integration allows you to:

  • Automatic mTLS encryption between all Antfly pods
  • Traffic observability through service mesh telemetry
  • Advanced traffic management (circuit breaking, retries, timeouts)
  • Zero-trust security with automatic certificate rotation
  • Network policy enforcement at the sidecar level

The operator automatically detects sidecar injection and updates cluster status accordingly.

Supported Service Meshes#

The Antfly Operator is designed to work with any Kubernetes service mesh that uses sidecar injection:

MeshStatusNotes
IstioRecommendedBest tested integration
LinkerdSupportedLightweight option
Consul ConnectSupportedHashiCorp ecosystem

Quick Start#

Prerequisites#

  1. Antfly Operator installed in your cluster
  2. Service mesh control plane installed (e.g., Istio, Linkerd)
  3. Service mesh sidecar injection configured (namespace-level or pod-level)

Enable Service Mesh on a New Cluster#

apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
  name: my-cluster
  namespace: production
spec:
  image: ghcr.io/antflydb/antfly:latest
  serviceMesh:
    enabled: true
    annotations:
      sidecar.istio.io/inject: "true"
  metadataNodes:
    replicas: 3
    resources:
      cpu: "500m"
      memory: "512Mi"
  dataNodes:
    replicas: 3
    resources:
      cpu: "1000m"
      memory: "2Gi"
  storage:
    storageClass: "standard"
    metadataStorage: "1Gi"
    dataStorage: "10Gi"

Enable Service Mesh on Existing Cluster#

Patch an existing cluster to enable service mesh:

kubectl patch antflycluster my-cluster -n production --type='merge' -p='
{
  "spec": {
    "serviceMesh": {
      "enabled": true,
      "annotations": {
        "sidecar.istio.io/inject": "true"
      }
    }
  }
}'

The operator will perform a rolling restart, injecting sidecars into each pod while maintaining cluster availability.

Configuration#

Spec Fields#

spec:
  serviceMesh:
    enabled: true              # Enable service mesh integration
    annotations:               # Mesh-specific annotations
      key: value

enabled (boolean, optional, default: false)#

Controls whether service mesh sidecar injection is enabled for the cluster.

annotations (map[string]string, optional)#

Mesh-specific annotations to apply to pod templates. These annotations trigger sidecar injection and configure mesh behavior.

Status Fields#

The operator automatically populates the following status fields:

status:
  serviceMeshStatus:
    enabled: true                        # Reflects spec.serviceMesh.enabled
    sidecarInjectionStatus: "Complete"   # Complete | Partial | None | Unknown
    podsWithSidecars: 6                  # Number of pods with sidecars
    totalPods: 6                         # Total number of pods
    lastTransitionTime: "2025-10-04T..."
  conditions:
  - type: ServiceMeshReady
    status: "True"
    reason: SidecarInjectionComplete
    message: "All 6 pods have sidecars injected"

Sidecar Injection Status Values#

StatusDescription
CompleteAll pods have sidecars injected
PartialSome pods have sidecars, others don't (blocks reconciliation)
NoneNo pods have sidecars
UnknownPod count is zero or status cannot be determined

Mesh-Specific Configuration#

Istio#

spec:
  serviceMesh:
    enabled: true
    annotations:
      sidecar.istio.io/inject: "true"
      # Exclude Raft ports from proxy (recommended for performance)
      traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"
      # Resource limits for sidecar (optional)
      sidecar.istio.io/proxyCPU: "100m"
      sidecar.istio.io/proxyMemory: "128Mi"

Important Ports:

PortServiceRecommendation
12377Metadata APIInclude in mesh
9017Metadata RaftExclude from mesh
12380Data APIInclude in mesh
9021Data RaftExclude from mesh

Consider excluding Raft ports (9017, 9021) from the service mesh to reduce latency for consensus traffic.

Linkerd#

spec:
  serviceMesh:
    enabled: true
    annotations:
      linkerd.io/inject: enabled
      # Skip Raft ports (recommended)
      config.linkerd.io/skip-outbound-ports: "9017,9021"
      config.linkerd.io/skip-inbound-ports: "9017,9021"

Consul Connect#

spec:
  serviceMesh:
    enabled: true
    annotations:
      consul.hashicorp.com/connect-inject: "true"
      consul.hashicorp.com/connect-service-upstreams: "antfly-metadata:12377,antfly-data:12380"

Observability#

Check Service Mesh Status#

View the current service mesh status:

kubectl get antflycluster my-cluster -o jsonpath='{.status.serviceMeshStatus}' | jq

Check ServiceMeshReady Condition#

kubectl get antflycluster my-cluster -o jsonpath='{.status.conditions[?(@.type=="ServiceMeshReady")]}' | jq

View Operator Logs#

Monitor service mesh integration events:

kubectl logs -n antfly-operator-namespace deployment/antfly-operator -f | grep -i "service mesh"

View Cluster Events#

Check for service mesh-related events:

kubectl get events --field-selector involvedObject.name=my-cluster -n production

Performance Optimization#

Exclude Raft Ports#

Raft consensus traffic is latency-sensitive. Exclude Raft ports from the mesh:

# Istio
annotations:
  traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"

# Linkerd
annotations:
  config.linkerd.io/skip-outbound-ports: "9017,9021"
  config.linkerd.io/skip-inbound-ports: "9017,9021"

Tune Sidecar Resources#

Set appropriate resource limits for sidecars:

annotations:
  sidecar.istio.io/proxyCPU: "100m"
  sidecar.istio.io/proxyMemory: "128Mi"
  sidecar.istio.io/proxyCPULimit: "500m"
  sidecar.istio.io/proxyMemoryLimit: "512Mi"

Sidecar Concurrency#

Tune proxy concurrency based on workload:

annotations:
  sidecar.istio.io/concurrency: "2"

Security Configuration#

Strict mTLS#

For maximum security, use strict mTLS mode:

# Istio PeerAuthentication
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: antfly-mtls
  namespace: production
spec:
  selector:
    matchLabels:
      app: antfly
  mtls:
    mode: STRICT

Network Policies#

Combine service mesh with Kubernetes NetworkPolicies for defense in depth:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: antfly-mesh-only
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: antfly
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: antfly

Troubleshooting#

Partial Sidecar Injection#

Problem: The operator detects partial sidecar injection and blocks reconciliation.

Symptoms:

  • ServiceMeshReady condition is False with reason PartialInjection
  • Operator logs show: "Blocking reconciliation" ... "partial sidecar injection"
  • Kubernetes events show: Warning PartialSidecarInjection

Solutions:

  1. Check mesh control plane:

    # Istio
    istioctl analyze -n production
    
    # Linkerd
    linkerd check
  2. Verify pod annotations:

    kubectl get pods -n production -l app.kubernetes.io/name=antfly-database \
      -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations}{"\n"}{end}'
  3. Check admission webhooks:

    kubectl get mutatingwebhookconfigurations | grep -i istio
  4. Force pod recreation:

    kubectl delete pod <pod-name> -n production

Sidecars Not Injected#

Problem: Service mesh is enabled but sidecars are not being injected.

Solutions:

  1. Verify annotations are correct:

    kubectl get antflycluster my-cluster -o yaml | grep -A 5 serviceMesh
  2. Check namespace labels (if using namespace-level injection):

    kubectl get namespace production --show-labels
  3. Verify StatefulSet pod template:

    kubectl get statefulset my-cluster-metadata -o jsonpath='{.spec.template.metadata.annotations}' | jq
  4. Test manual injection (debugging):

    # Istio
    istioctl kube-inject -f examples/service-mesh-istio-cluster.yaml
    
    # Linkerd
    linkerd inject examples/service-mesh-linkerd-cluster.yaml

High Latency After Enabling Mesh#

Problem: Database latency increases significantly after enabling service mesh.

Solutions:

  1. Exclude Raft ports from mesh (see Performance Optimization above)

  2. Tune sidecar resource limits:

    annotations:
      sidecar.istio.io/proxyCPU: "200m"
      sidecar.istio.io/proxyMemory: "256Mi"
  3. Check mTLS overhead:

    # Istio - view proxy stats
    istioctl proxy-config endpoint <pod-name> -n production

Rolling Restart Failures#

Problem: Pods fail to restart with sidecars during rolling update.

Solutions:

  1. Check resource quotas:

    kubectl describe resourcequota -n production
  2. Verify PodDisruptionBudget (if using GKE):

    kubectl get pdb -n production
  3. Check StatefulSet events:

    kubectl describe statefulset my-cluster-metadata -n production

Best Practices#

Production Deployments#

  1. Start with data nodes: Enable service mesh on data nodes first, verify stability, then enable for metadata nodes

  2. Use resource limits: Set appropriate sidecar resource limits to prevent OOM

    annotations:
      sidecar.istio.io/proxyCPU: "100m"
      sidecar.istio.io/proxyMemory: "128Mi"
      sidecar.istio.io/proxyCPULimit: "500m"
      sidecar.istio.io/proxyMemoryLimit: "512Mi"
  3. Exclude Raft ports: Reduce latency by excluding consensus traffic from mesh

    annotations:
      traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"
  4. Monitor during rollout: Watch cluster status during rolling restart

    watch kubectl get antflycluster my-cluster -o jsonpath='{.status.serviceMeshStatus}'

Security Considerations#

  1. mTLS mode: Use STRICT mode for maximum security
  2. Network policies: Combine service mesh with Kubernetes NetworkPolicies
  3. Certificate rotation: Service mesh handles automatic rotation - no operator action needed

Connection Pooling#

Configure at service mesh level:

# Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: antfly-connection-pool
  namespace: production
spec:
  host: my-cluster-metadata
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100

Limitations#

  1. Metadata nodes: Service mesh adds latency to Raft consensus. Consider excluding Raft ports or disabling mesh on metadata nodes for latency-sensitive workloads.

  2. Partial injection: The operator blocks reconciliation when partial injection is detected to prevent split-brain scenarios. Resolve the injection issue before proceeding.

  3. Mesh upgrades: Upgrade the service mesh control plane independently. The operator will detect sidecar version changes but does not manage mesh upgrades.

Examples#

See the examples/ directory for complete configuration examples:

  • examples/service-mesh-istio-cluster.yaml - Istio integration
  • examples/service-mesh-linkerd-cluster.yaml - Linkerd integration

See Also#