This guide covers health checks, metrics, and observability for Antfly clusters.
Overview
The Antfly Operator provides multiple monitoring capabilities:
- Health Check Endpoints: Kubernetes probes for pod health
- Operator Metrics: Prometheus metrics from the operator
- Cluster Status: Status conditions and fields on CRDs
- Kubernetes Events: Operational events for troubleshooting
- Logging: Configurable log formats and levels
Health Check Endpoints
All Antfly pods expose health endpoints on port 4200:
| Endpoint | Purpose | Used By |
|---|---|---|
/healthz | Liveness check | Startup and liveness probes |
/readyz | Readiness check | Readiness probe |
Probe Configuration
The operator automatically configures probes for all pods:
Startup Probe:
- Endpoint:
:4200/healthz - Initial delay: 30 seconds
- Period: 10 seconds
- Failure threshold: 30 (allows up to 5 minutes for startup)
Liveness Probe:
- Endpoint:
:4200/healthz - Period: 15 seconds
- Failure threshold: 3 (pod restarted after 45 seconds of failures)
Readiness Probe:
- Endpoint:
:4200/readyz - Period: 5 seconds
- Failure threshold: 5 (pod removed from service after 25 seconds of failures)
Custom Health Port
Configure a custom health check port:
spec:
metadataNodes:
health:
port: 4200 # Default
dataNodes:
health:
port: 4200 # DefaultManual Health Checks
Check pod health manually:
Cluster Status
Viewing Status
# Quick overview
kubectl get antflycluster my-cluster
# NAME PHASE METADATA DATA AGE
# my-cluster Running 3 3 24h
# Detailed status
kubectl get antflycluster my-cluster -o yamlStatus Fields
status:
phase: Running
metadataNodesReady: 3
dataNodesReady: 3
conditions:
- type: ConfigurationValid
status: "True"
reason: ValidationPassed
lastTransitionTime: "2025-01-15T00:00:00Z"
- type: SecretsReady
status: "True"
reason: AllSecretsFound
lastTransitionTime: "2025-01-15T00:00:00Z"
autoScalingStatus:
currentReplicas: 3
desiredReplicas: 3
lastScaleTime: "2025-01-15T10:30:00Z"
currentCPUUtilizationPercentage: 45Cluster Phases
| Phase | Description |
|---|---|
Pending | Cluster is being created |
Running | All nodes are ready |
Degraded | Some nodes are not ready |
Failed | Critical error |
Condition Types
| Condition | Description |
|---|---|
ConfigurationValid | Configuration passes validation |
SecretsReady | All referenced secrets exist |
ServiceMeshReady | Service mesh sidecars injected (if enabled) |
Check Specific Conditions
# Configuration validation status
kubectl get antflycluster my-cluster -o jsonpath='{.status.conditions[?(@.type=="ConfigurationValid")]}' | jq
# Secrets status
kubectl get antflycluster my-cluster -o jsonpath='{.status.conditions[?(@.type=="SecretsReady")]}' | jq
# Service mesh status
kubectl get antflycluster my-cluster -o jsonpath='{.status.serviceMeshStatus}' | jqOperator Metrics
The operator exposes Prometheus metrics on port 8080.
Accessing Metrics
# Port-forward to operator
kubectl port-forward -n antfly-operator-namespace deployment/antfly-operator 8080:8080
# View metrics
curl http://localhost:8080/metricsKey Metrics
| Metric | Type | Description |
|---|---|---|
controller_runtime_reconcile_total | Counter | Total reconciliations |
controller_runtime_reconcile_errors_total | Counter | Reconciliation errors |
controller_runtime_reconcile_time_seconds | Histogram | Reconciliation duration |
workqueue_depth | Gauge | Reconciliation queue depth |
workqueue_adds_total | Counter | Items added to queue |
Prometheus ServiceMonitor
For Prometheus Operator, create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: antfly-operator
namespace: antfly-operator-namespace
spec:
selector:
matchLabels:
control-plane: antfly-operator
endpoints:
- port: metrics
path: /metrics
interval: 30sOperator Health Endpoints
The operator also exposes health endpoints:
# Health check
curl http://localhost:8081/healthz
# Readiness check
curl http://localhost:8081/readyzKubernetes Events
The operator emits events for important operations:
# View cluster events
kubectl get events --field-selector involvedObject.name=my-cluster
# Watch events in real-time
kubectl get events -w --field-selector involvedObject.name=my-clusterCommon Events
| Event | Type | Description |
|---|---|---|
ClusterCreated | Normal | Cluster resources created |
ClusterUpdated | Normal | Cluster configuration updated |
ScalingUp | Normal | Autoscaling increased replicas |
ScalingDown | Normal | Autoscaling decreased replicas |
ValidationFailed | Warning | Configuration validation failed |
SecretNotFound | Warning | Referenced secret not found |
ReconcileError | Warning | Error during reconciliation |
Logging Configuration
Antfly Database Logs
Configure logging via the cluster config:
spec:
config: |
{
"log": {
"level": "info", // debug, info, warn, error
"style": "json" // terminal, json, noop
}
}Log Styles:
terminal: Colorized console output with stack traces (development)json: Structured JSON output (production, log aggregation)noop: No logging output (benchmarking)
Log Levels:
debug: Verbose debugging informationinfo: Normal operational informationwarn: Warnings that don't affect operationerror: Errors that may affect operation
View Pod Logs
# Metadata node logs
kubectl logs my-cluster-metadata-0
# Data node logs
kubectl logs my-cluster-data-0
# Follow logs
kubectl logs -f my-cluster-metadata-0
# Previous container logs (after restart)
kubectl logs --previous my-cluster-metadata-0Operator Logs
# View operator logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator
# Follow operator logs
kubectl logs -n antfly-operator-namespace deployment/antfly-operator -f
# Filter for specific cluster
kubectl logs -n antfly-operator-namespace deployment/antfly-operator | grep my-clusterLog Aggregation
JSON Logging for Aggregation
For production environments, use JSON logging:
spec:
config: |
{
"log": {
"level": "info",
"style": "json"
}
}JSON logs integrate well with:
- Google Cloud Logging (Stackdriver)
- AWS CloudWatch
- Elastic Stack (ELK)
- Splunk
- Datadog
Fluent Bit Example
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/my-cluster-*.log
Parser docker
[FILTER]
Name kubernetes
Match *
Merge_Log On
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index antfly-logsDashboard Integration
Grafana Dashboards
Key panels for an Antfly monitoring dashboard:
-
Cluster Overview
- Cluster phase
- Metadata nodes ready
- Data nodes ready
- Autoscaling status
-
Pod Health
- Pod restart count
- Container ready status
- Resource utilization
-
Operator Health
- Reconciliation rate
- Reconciliation errors
- Queue depth
Sample PromQL Queries
# Reconciliation error rate
rate(controller_runtime_reconcile_errors_total{controller="antflycluster"}[5m])
# Reconciliation latency (p99)
histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket{controller="antflycluster"}[5m]))
# Queue depth
workqueue_depth{name="antflycluster"}
# Pod restarts
kube_pod_container_status_restarts_total{pod=~"my-cluster-.*"}Alerting
Common Alerts
groups:
- name: antfly-alerts
rules:
# Cluster not running
- alert: AntflyClusterDegraded
expr: |
kube_customresource_antfly_cluster_status_phase != "Running"
for: 5m
labels:
severity: warning
annotations:
summary: "Antfly cluster {{ $labels.name }} is not running"
# High reconciliation error rate
- alert: AntflyOperatorReconcileErrors
expr: |
rate(controller_runtime_reconcile_errors_total{controller="antflycluster"}[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Antfly operator has high reconciliation error rate"
# Pod restarts
- alert: AntflyPodRestarting
expr: |
increase(kube_pod_container_status_restarts_total{pod=~".*-metadata-.*|.*-data-.*"}[1h]) > 3
labels:
severity: warning
annotations:
summary: "Antfly pod {{ $labels.pod }} has restarted multiple times"Debugging
Enable Debug Logging
For troubleshooting, enable debug endpoints and logging:
spec:
config: |
{
"log": {
"level": "debug",
"style": "terminal"
},
"enable_debug_endpoints": true
}Inspect Pod Details
# Describe pod for events and conditions
kubectl describe pod my-cluster-metadata-0
# Check container status
kubectl get pod my-cluster-metadata-0 -o jsonpath='{.status.containerStatuses}' | jq
# Check resource usage
kubectl top pod my-cluster-metadata-0Network Debugging
# Check service endpoints
kubectl get endpoints my-cluster-metadata
# Test connectivity between pods
kubectl exec my-cluster-data-0 -- nc -zv my-cluster-metadata-0.my-cluster-metadata 12377Best Practices
- Use JSON Logging in Production: Enable structured logging for log aggregation
- Set Up Alerting: Configure alerts for degraded clusters and operator errors
- Monitor Resource Usage: Track CPU and memory for capacity planning
- Review Events Regularly: Check Kubernetes events for operational issues
- Retain Logs: Configure log retention for troubleshooting historical issues
- Dashboard Visibility: Create dashboards for cluster health visibility
- Test Monitoring: Verify alerts fire correctly during failure scenarios
See Also
- Autoscaling: Autoscaling configuration and monitoring
- Troubleshooting: Common issues and solutions
- AntflyCluster API Reference: Complete status reference