Service Mesh - Antfly Documentation

Warning: Service mesh integration is experimental. APIs and behavior may change in future releases.

The Antfly Operator provides native support for service mesh integration, enabling automatic mTLS encryption and traffic management for your database clusters.

Overview#

Service mesh integration allows you to:

Automatic mTLS encryption between all Antfly pods
Traffic observability through service mesh telemetry
Advanced traffic management (circuit breaking, retries, timeouts)
Zero-trust security with automatic certificate rotation
Network policy enforcement at the sidecar level

The operator automatically detects sidecar injection and updates cluster status accordingly.

Supported Service Meshes#

The Antfly Operator is designed to work with any Kubernetes service mesh that uses sidecar injection:

Mesh	Status	Notes
Istio	Recommended	Best tested integration
Linkerd	Supported	Lightweight option
Consul Connect	Supported	HashiCorp ecosystem

Quick Start#

Prerequisites#

Antfly Operator installed in your cluster
Service mesh control plane installed (e.g., Istio, Linkerd)
Service mesh sidecar injection configured (namespace-level or pod-level)

Enable Service Mesh on a New Cluster#

apiVersion: antfly.io/v1
kind: AntflyCluster
metadata:
  name: my-cluster
  namespace: production
spec:
  image: ghcr.io/antflydb/antfly:latest
  serviceMesh:
    enabled: true
    annotations:
      sidecar.istio.io/inject: "true"
  metadataNodes:
    replicas: 3
    resources:
      cpu: "500m"
      memory: "512Mi"
  dataNodes:
    replicas: 3
    resources:
      cpu: "1000m"
      memory: "2Gi"
  storage:
    storageClass: "standard"
    metadataStorage: "1Gi"
    dataStorage: "10Gi"

Enable Service Mesh on Existing Cluster#

Patch an existing cluster to enable service mesh:

kubectl patch antflycluster my-cluster -n production --type='merge' -p='
{
  "spec": {
    "serviceMesh": {
      "enabled": true,
      "annotations": {
        "sidecar.istio.io/inject": "true"
      }
    }
  }
}'

The operator will perform a rolling restart, injecting sidecars into each pod while maintaining cluster availability.

Configuration#

Spec Fields#

spec:
  serviceMesh:
    enabled: true              # Enable service mesh integration
    annotations:               # Mesh-specific annotations
      key: value

`enabled` (boolean, optional, default: `false`)#

Controls whether service mesh sidecar injection is enabled for the cluster.

`annotations` (map[string]string, optional)#

Mesh-specific annotations to apply to pod templates. These annotations trigger sidecar injection and configure mesh behavior.

Status Fields#

The operator automatically populates the following status fields:

status:
  serviceMeshStatus:
    enabled: true                        # Reflects spec.serviceMesh.enabled
    sidecarInjectionStatus: "Complete"   # Complete | Partial | None | Unknown
    podsWithSidecars: 6                  # Number of pods with sidecars
    totalPods: 6                         # Total number of pods
    lastTransitionTime: "2025-10-04T..."
  conditions:
  - type: ServiceMeshReady
    status: "True"
    reason: SidecarInjectionComplete
    message: "All 6 pods have sidecars injected"

Sidecar Injection Status Values#

Status	Description
`Complete`	All pods have sidecars injected
`Partial`	Some pods have sidecars, others don't (blocks reconciliation)
`None`	No pods have sidecars
`Unknown`	Pod count is zero or status cannot be determined

Mesh-Specific Configuration#

Istio#

spec:
  serviceMesh:
    enabled: true
    annotations:
      sidecar.istio.io/inject: "true"
      # Exclude Raft ports from proxy (recommended for performance)
      traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"
      # Resource limits for sidecar (optional)
      sidecar.istio.io/proxyCPU: "100m"
      sidecar.istio.io/proxyMemory: "128Mi"

Important Ports:

Port	Service	Recommendation
12377	Metadata API	Include in mesh
9017	Metadata Raft	Exclude from mesh
12380	Data API	Include in mesh
9021	Data Raft	Exclude from mesh

Consider excluding Raft ports (9017, 9021) from the service mesh to reduce latency for consensus traffic.

Linkerd#

spec:
  serviceMesh:
    enabled: true
    annotations:
      linkerd.io/inject: enabled
      # Skip Raft ports (recommended)
      config.linkerd.io/skip-outbound-ports: "9017,9021"
      config.linkerd.io/skip-inbound-ports: "9017,9021"

Consul Connect#

spec:
  serviceMesh:
    enabled: true
    annotations:
      consul.hashicorp.com/connect-inject: "true"
      consul.hashicorp.com/connect-service-upstreams: "antfly-metadata:12377,antfly-data:12380"

Observability#

Check Service Mesh Status#

View the current service mesh status:

kubectl get antflycluster my-cluster -o jsonpath='{.status.serviceMeshStatus}' | jq

Check ServiceMeshReady Condition#

kubectl get antflycluster my-cluster -o jsonpath='{.status.conditions[?(@.type=="ServiceMeshReady")]}' | jq

View Operator Logs#

Monitor service mesh integration events:

kubectl logs -n antfly-operator-namespace deployment/antfly-operator -f | grep -i "service mesh"

View Cluster Events#

Check for service mesh-related events:

kubectl get events --field-selector involvedObject.name=my-cluster -n production

Performance Optimization#

Exclude Raft Ports#

Raft consensus traffic is latency-sensitive. Exclude Raft ports from the mesh:

# Istio
annotations:
  traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"

# Linkerd
annotations:
  config.linkerd.io/skip-outbound-ports: "9017,9021"
  config.linkerd.io/skip-inbound-ports: "9017,9021"

Tune Sidecar Resources#

Set appropriate resource limits for sidecars:

annotations:
  sidecar.istio.io/proxyCPU: "100m"
  sidecar.istio.io/proxyMemory: "128Mi"
  sidecar.istio.io/proxyCPULimit: "500m"
  sidecar.istio.io/proxyMemoryLimit: "512Mi"

Sidecar Concurrency#

Tune proxy concurrency based on workload:

annotations:
  sidecar.istio.io/concurrency: "2"

Security Configuration#

Strict mTLS#

For maximum security, use strict mTLS mode:

# Istio PeerAuthentication
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: antfly-mtls
  namespace: production
spec:
  selector:
    matchLabels:
      app: antfly
  mtls:
    mode: STRICT

Network Policies#

Combine service mesh with Kubernetes NetworkPolicies for defense in depth:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: antfly-mesh-only
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: antfly
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: antfly

Troubleshooting#

Partial Sidecar Injection#

Problem: The operator detects partial sidecar injection and blocks reconciliation.

Symptoms:

ServiceMeshReady condition is False with reason PartialInjection
Operator logs show: "Blocking reconciliation" ... "partial sidecar injection"
Kubernetes events show: Warning PartialSidecarInjection

Solutions:

Check mesh control plane:

# Istio
istioctl analyze -n production

# Linkerd
linkerd check

Verify pod annotations:

kubectl get pods -n production -l app.kubernetes.io/name=antfly-database \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations}{"\n"}{end}'

Check admission webhooks:

kubectl get mutatingwebhookconfigurations | grep -i istio

Force pod recreation:

kubectl delete pod <pod-name> -n production

Sidecars Not Injected#

Problem: Service mesh is enabled but sidecars are not being injected.

Solutions:

Verify annotations are correct:

kubectl get antflycluster my-cluster -o yaml | grep -A 5 serviceMesh

Check namespace labels (if using namespace-level injection):
```
kubectl get namespace production --show-labels
```

Verify StatefulSet pod template:

kubectl get statefulset my-cluster-metadata -o jsonpath='{.spec.template.metadata.annotations}' | jq

Test manual injection (debugging):

# Istio
istioctl kube-inject -f examples/service-mesh-istio-cluster.yaml

# Linkerd
linkerd inject examples/service-mesh-linkerd-cluster.yaml

High Latency After Enabling Mesh#

Problem: Database latency increases significantly after enabling service mesh.

Solutions:

Exclude Raft ports from mesh (see Performance Optimization above)

Tune sidecar resource limits:

annotations:
  sidecar.istio.io/proxyCPU: "200m"
  sidecar.istio.io/proxyMemory: "256Mi"

Check mTLS overhead:

# Istio - view proxy stats
istioctl proxy-config endpoint <pod-name> -n production

Rolling Restart Failures#

Problem: Pods fail to restart with sidecars during rolling update.

Solutions:

Check resource quotas:

kubectl describe resourcequota -n production

Verify PodDisruptionBudget (if using GKE):
```
kubectl get pdb -n production
```

Check StatefulSet events:

kubectl describe statefulset my-cluster-metadata -n production

Best Practices#

Production Deployments#

Start with data nodes: Enable service mesh on data nodes first, verify stability, then enable for metadata nodes

Use resource limits: Set appropriate sidecar resource limits to prevent OOM

annotations:
  sidecar.istio.io/proxyCPU: "100m"
  sidecar.istio.io/proxyMemory: "128Mi"
  sidecar.istio.io/proxyCPULimit: "500m"
  sidecar.istio.io/proxyMemoryLimit: "512Mi"

Exclude Raft ports: Reduce latency by excluding consensus traffic from mesh

annotations:
  traffic.sidecar.istio.io/excludeOutboundPorts: "9017,9021"

Monitor during rollout: Watch cluster status during rolling restart

watch kubectl get antflycluster my-cluster -o jsonpath='{.status.serviceMeshStatus}'

Security Considerations#

mTLS mode: Use STRICT mode for maximum security
Network policies: Combine service mesh with Kubernetes NetworkPolicies
Certificate rotation: Service mesh handles automatic rotation - no operator action needed

Connection Pooling#

Configure at service mesh level:

# Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: antfly-connection-pool
  namespace: production
spec:
  host: my-cluster-metadata
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100

Limitations#

Metadata nodes: Service mesh adds latency to Raft consensus. Consider excluding Raft ports or disabling mesh on metadata nodes for latency-sensitive workloads.
Partial injection: The operator blocks reconciliation when partial injection is detected to prevent split-brain scenarios. Resolve the injection issue before proceeding.
Mesh upgrades: Upgrade the service mesh control plane independently. The operator will detect sidecar version changes but does not manage mesh upgrades.

Examples#

See the examples/ directory for complete configuration examples:

examples/service-mesh-istio-cluster.yaml - Istio integration
examples/service-mesh-linkerd-cluster.yaml - Linkerd integration

Overview#

Supported Service Meshes#

Quick Start#

Prerequisites#

Enable Service Mesh on a New Cluster#

Enable Service Mesh on Existing Cluster#

Configuration#

Spec Fields#

enabled (boolean, optional, default: false)#

annotations (map[string]string, optional)#

Status Fields#

Sidecar Injection Status Values#

Mesh-Specific Configuration#

Istio#

Linkerd#

Consul Connect#

Observability#

Check Service Mesh Status#

Check ServiceMeshReady Condition#

View Operator Logs#

View Cluster Events#

Performance Optimization#

Exclude Raft Ports#

Tune Sidecar Resources#

Sidecar Concurrency#

Security Configuration#

Strict mTLS#

Network Policies#

Troubleshooting#

Partial Sidecar Injection#

Sidecars Not Injected#

High Latency After Enabling Mesh#

Rolling Restart Failures#

Best Practices#

Production Deployments#

Security Considerations#

Connection Pooling#

Limitations#

Examples#

See Also#

`enabled` (boolean, optional, default: `false`)#

`annotations` (map[string]string, optional)#