Real-World Use Cases

Practical examples of CronJob Guardian configurations for common scenarios.

Database Backups

Monitor critical database backup jobs that must never be missed.

database-backup-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: database-backups
spec:
  selector:
    matchLabels:
      app: backup
      type: database
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      buffer: 15m
      missedScheduleThreshold: 1
  sla:
    minSuccessRate: 100
    windowDays: 7
  alerting:
    includeContext:
      logs: true
      events: true
    channelRefs:
      - name: pagerduty-critical
      - name: ops-slack

Why This Configuration

100% success rate: Backups are critical; any failure is unacceptable
Auto-detect schedule: Works with any backup frequency
Single miss threshold: Alert immediately if backup doesn't run
Full context: Include logs to diagnose backup failures quickly

ETL Pipelines

Monitor data processing jobs that feed analytics and reporting.

etl-pipeline-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: etl-pipelines
spec:
  selector:
    matchLabels:
      type: etl
  sla:
    minSuccessRate: 95
    windowDays: 7
    maxDuration: 2h
    durationRegressionThreshold: 30
    durationBaselineWindowDays: 14
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
  alerting:
    suppressDuplicatesFor: 1h
    severityOverrides:
      durationRegression: warning
    channelRefs:
      - name: data-team-slack

Why This Configuration

Duration monitoring: ETL jobs getting slower indicates data growth issues
30% regression threshold: Catch performance degradation early
Duplicate suppression: Prevent alert fatigue during incident response
Warning severity for regression: Duration issues are important but not critical

Report Generation

Monitor scheduled report generation for business users.

report-generation-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: business-reports
spec:
  selector:
    matchLabels:
      type: report
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 25h
  sla:
    minSuccessRate: 90
    windowDays: 30
  maintenanceWindows:
    - name: monthly-maintenance
      schedule: "0 2 1 * *"
      duration: 4h
      timezone: America/New_York
      suppressAlerts: true
  alerting:
    channelRefs:
      - name: ops-slack

Why This Configuration

25-hour window: Daily reports with buffer for occasional delays
90% success rate: Some failures acceptable for less critical reports
Maintenance window: Suppress alerts during monthly maintenance
30-day window: Longer evaluation period for occasional jobs

Multi-Tenant Batch Jobs

Monitor jobs across multiple namespaces with different SLAs.

multi-tenant-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: tenant-jobs-premium
spec:
  selector:
    namespaceSelector:
      matchLabels:
        tier: premium
    matchLabels:
      managed-by: platform
  sla:
    minSuccessRate: 99
    windowDays: 7
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      missedScheduleThreshold: 1
  alerting:
    channelRefs:
      - name: pagerduty-premium
      - name: customer-slack
---
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: tenant-jobs-standard
spec:
  selector:
    namespaceSelector:
      matchLabels:
        tier: standard
    matchLabels:
      managed-by: platform
  sla:
    minSuccessRate: 95
    windowDays: 7
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      missedScheduleThreshold: 3
  alerting:
    channelRefs:
      - name: ops-slack

Why This Configuration

Separate monitors by tier: Different SLAs for different service levels
Namespace selector: Automatically covers all tenant namespaces
Stricter thresholds for premium: Higher success rate, faster alerts
More tolerance for standard: 3 consecutive misses before alerting

Infrastructure Maintenance Jobs

Monitor cluster maintenance tasks like certificate rotation and cleanup.

infrastructure-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: infra-maintenance
spec:
  selector:
    matchLabels:
      category: infrastructure
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      buffer: 1h
  sla:
    minSuccessRate: 100
  alerting:
    includeContext:
      logs: true
      events: true
      podStatus: true
    suggestedFixPatterns:
      - name: cert-rotation-failed
        match:
          logPattern: "certificate.*expired|x509.*certificate"
        suggestion: |
          Certificate rotation failed. Check:
          1. kubectl get secrets -n cert-manager
          2. kubectl describe certificate <name>
          3. Manually trigger: kubectl cert-manager renew <name>
        priority: 150
    channelRefs:
      - name: platform-pagerduty

Why This Configuration

100% success required: Infrastructure jobs are critical
Custom fix patterns: Domain-specific troubleshooting guidance
Full context: All diagnostic information included
Hour buffer: Allow for scheduling variance in cluster

Compliance Audit Jobs

Monitor jobs required for regulatory compliance with audit trail.

compliance-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: compliance-audits
spec:
  selector:
    matchLabels:
      compliance: required
  deadManSwitch:
    enabled: true
    autoFromSchedule:
      enabled: true
      missedScheduleThreshold: 1
  sla:
    minSuccessRate: 100
    windowDays: 90
  dataRetention:
    retentionDays: 365
    storeLogs: true
  alerting:
    channelRefs:
      - name: compliance-email
      - name: security-slack
      - name: pagerduty-critical

Why This Configuration

365-day retention: Keep audit trail for compliance requirements
90-day SLA window: Quarterly compliance reporting
Multiple channels: Ensure alerts reach compliance team
Store logs: Full execution history for audits

High-Frequency Jobs

Monitor jobs that run every few minutes.

high-frequency-monitor.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: health-checks
spec:
  selector:
    matchLabels:
      type: health-check
  deadManSwitch:
    enabled: true
    maxTimeSinceLastSuccess: 10m
  sla:
    minSuccessRate: 99
    windowDays: 1
  alerting:
    alertDelay: 5m
    suppressDuplicatesFor: 15m
    channelRefs:
      - name: ops-slack

Why This Configuration

Short dead-man window: 10 minutes for jobs running every 5 minutes
1-day SLA window: Appropriate for high-frequency jobs
Alert delay: Wait 5 minutes to avoid transient failures
Duplicate suppression: Prevent alert storms

Monitor Examples - More CronJobMonitor configurations
Channel Examples - AlertChannel configurations
Production Setup - Production best practices

Database Backups​

Why This Configuration​

ETL Pipelines​

Why This Configuration​

Report Generation​

Why This Configuration​

Multi-Tenant Batch Jobs​

Why This Configuration​

Infrastructure Maintenance Jobs​

Why This Configuration​

Compliance Audit Jobs​

Why This Configuration​

High-Frequency Jobs​

Why This Configuration​

Related​

Database Backups

Why This Configuration

ETL Pipelines

Why This Configuration

Report Generation

Why This Configuration

Multi-Tenant Batch Jobs

Why This Configuration

Infrastructure Maintenance Jobs

Why This Configuration

Compliance Audit Jobs

Why This Configuration

High-Frequency Jobs

Why This Configuration

Related