High Availability Setup
This guide covers deploying CronJob Guardian in a highly available configuration.
Requirements
For HA deployment:
- External database: PostgreSQL or MySQL (SQLite doesn't support HA)
- Multiple replicas: 2+ operator pods
- Leader election: Enabled for safe coordination
Architecture
┌─────────────────┐ ┌─────────────────┐
│ Replica 1 │ │ Replica 2 │
│ (Leader) │ │ (Standby) │
│ │ │ │
│ • Controllers │ │ • Watches │
│ • Schedulers │ │ • Health checks │
│ • API server │ │ • API server │
└────────┬────────┘ └────────┬────────┘
│ │
└──────────┬───────────┘
│
┌──────────▼──────────┐
│ PostgreSQL │
│ (shared state) │
└─────────────────────┘
Configuration
Minimal HA Setup
values-ha.yaml
replicaCount: 2
leaderElection:
enabled: true
config:
storage:
type: postgres
postgres:
host: postgres.database.svc
database: guardian
username: guardian
existingSecret: postgres-credentials
Full Production HA
values-ha-production.yaml
replicaCount: 3
leaderElection:
enabled: true
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
config:
storage:
type: postgres
postgres:
host: postgres.database.svc.cluster.local
port: 5432
database: cronjob_guardian
username: guardian
existingSecret: postgres-credentials
sslMode: require
maxOpenConns: 25
maxIdleConns: 10
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
podDisruptionBudget:
enabled: true
minAvailable: 1
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: cronjob-guardian
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: cronjob-guardian
Leader Election
How It Works
- All replicas compete for a lease lock
- One replica becomes the leader
- Leader runs controllers and schedulers
- Standby replicas monitor and serve API
- If leader fails, another replica takes over
Configuration
leaderElection:
enabled: true
leaseDuration: 15s # How long lease is valid
renewDeadline: 10s # Deadline to renew
retryPeriod: 2s # Retry interval
Failover Timing
With default settings:
- Leader renewal every 10s
- Lease expires after 15s
- New leader election within ~5s
- Total failover: ~20-30 seconds
Aggressive Settings (Faster Failover)
leaderElection:
enabled: true
leaseDuration: 10s
renewDeadline: 8s
retryPeriod: 1s
Faster failover (~15s) but more resource usage.
Pod Disruption Budget
Prevent all replicas from being evicted:
podDisruptionBudget:
enabled: true
minAvailable: 1 # At least 1 pod always running
Or with percentage:
podDisruptionBudget:
enabled: true
minAvailable: 50%
Anti-Affinity
Spread replicas across nodes:
affinity:
podAntiAffinity:
# Hard requirement: different nodes
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/name: cronjob-guardian
topologyKey: kubernetes.io/hostname
Or soft preference:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: cronjob-guardian
topologyKey: kubernetes.io/hostname
Zone Spreading
Spread across availability zones:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: cronjob-guardian
Database HA
Ensure your database is also highly available:
PostgreSQL HA Options
- Managed services: RDS, Cloud SQL, Azure Database
- Patroni: Self-managed PostgreSQL HA
- Zalando PostgreSQL Operator: Kubernetes-native
Connection Handling
During database failover:
config:
storage:
postgres:
maxOpenConns: 25
connMaxLifetime: 5m # Reconnect periodically
healthCheckInterval: 30s
Monitoring HA
Metrics
Key metrics for HA monitoring:
# Leader status
cronjob_guardian_leader_status
# Replica count
sum(up{job="cronjob-guardian"})
# Database connections
cronjob_guardian_db_connections_open
Alerts
groups:
- name: cronjob-guardian-ha
rules:
- alert: CronJobGuardianNoLeader
expr: sum(cronjob_guardian_leader_status) == 0
for: 1m
labels:
severity: critical
annotations:
summary: No CronJob Guardian leader elected
- alert: CronJobGuardianReplicasLow
expr: sum(up{job="cronjob-guardian"}) < 2
for: 5m
labels:
severity: warning
annotations:
summary: CronJob Guardian has fewer than 2 replicas
Health Checks
Liveness Probe
Checks if the process is healthy:
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
Readiness Probe
Checks if ready to serve:
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
Graceful Shutdown
Ensure graceful termination:
terminationGracePeriodSeconds: 30
During shutdown:
- Stop accepting new work
- Complete in-flight operations
- Release leader lock
- Close database connections
Testing HA
Simulate Leader Failure
# Find the leader
kubectl get lease -n cronjob-guardian cronjob-guardian-leader -o yaml
# Delete the leader pod
kubectl delete pod -n cronjob-guardian <leader-pod>
# Watch failover
kubectl logs -n cronjob-guardian -l app.kubernetes.io/name=cronjob-guardian -f
Rolling Update
# Trigger rolling update
kubectl rollout restart -n cronjob-guardian deploy/cronjob-guardian
# Watch status
kubectl rollout status -n cronjob-guardian deploy/cronjob-guardian
Troubleshooting
No Leader Elected
- Check all pods are running
- Verify network connectivity between pods
- Check lease resource exists
- Review pod logs for election errors
Split Brain
Prevented by:
- Single database as source of truth
- Lease-based leader election
- Fencing (standby replicas don't run controllers)
Slow Failover
- Reduce lease duration (increases load)
- Check pod readiness probes
- Verify database connectivity
Related
- PostgreSQL Storage - Database setup
- Production Setup - Full production guide
- Prometheus Integration - Monitoring