Skip to main content

CronJobMonitor Examples

Collection of CronJobMonitor configurations for common scenarios.

Basic Monitor

Monitor all CronJobs in a namespace:

basic.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: all-jobs
namespace: production
spec:
selector: {}

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 2

alerting:
channelRefs:
- name: team-slack

Critical Tier Monitor

Strict monitoring for critical jobs:

critical-tier.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: critical-jobs
namespace: production
spec:
selector:
matchLabels:
tier: critical

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 1
buffer: 10m

sla:
minSuccessRate: 99.9
windowDays: 30
maxDuration: 30m
durationRegressionThreshold: 25
durationBaselineWindowDays: 14

alerting:
channelRefs:
- name: pagerduty-critical
severities:
- critical
- name: ops-slack

alertDelay: 0s
suppressDuplicatesFor: 15m

severityOverrides:
jobFailed: critical
deadManTriggered: critical
slaBreached: critical

includeContext:
logs: true
events: true
podStatus: true
logLines: 100

Database Backup Monitor

Monitor database backup jobs with strict SLA:

database-backups.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: database-backups
namespace: production
spec:
selector:
matchLabels:
app: postgres-backup

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 1
buffer: 30m

sla:
minSuccessRate: 100
windowDays: 7
maxDuration: 2h

maintenanceWindows:
- name: sunday-maintenance
schedule: "0 10 * * 0" # Sunday 10 AM
duration: 2h
timezone: America/New_York

alerting:
channelRefs:
- name: dba-pagerduty
severities:
- critical
- name: dba-slack

severityOverrides:
jobFailed: critical
deadManTriggered: critical

suggestedFixPatterns:
- name: disk-space
match:
logPattern: "no space left on device"
suggestion: |
Backup volume is full. Check PVC usage:
kubectl exec -n {{.Namespace}} deploy/postgres -- df -h /backups
priority: 150

dataRetention:
retentionDays: 365
storeLogs: true
logRetentionDays: 90

ETL Pipeline Monitor

Monitor data pipelines with duration tracking:

data-pipeline.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: etl-pipeline
namespace: data
spec:
selector:
matchLabels:
pipeline: etl

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 2

sla:
minSuccessRate: 98
windowDays: 14
maxDuration: 45m
durationRegressionThreshold: 30
durationBaselineWindowDays: 7

alerting:
channelRefs:
- name: data-team-slack
- name: ops-pagerduty
severities:
- critical

alertDelay: 5m
suppressDuplicatesFor: 1h

severityOverrides:
jobFailed: warning
deadManTriggered: critical
slaBreached: warning
durationRegression: warning

suggestedFixPatterns:
- name: source-unavailable
match:
logPattern: "connection refused.*source-db|ECONNREFUSED.*5432"
suggestion: "Source database unavailable. Check source DB health."
priority: 150
- name: destination-full
match:
logPattern: "disk quota exceeded|ENOSPC"
suggestion: "Destination storage full. Increase PVC or clean up old data."
priority: 145

Multi-Namespace Monitor

Watch CronJobs across multiple namespaces:

multi-namespace.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: all-production
namespace: cronjob-guardian
spec:
namespaces:
- production
- staging
- batch

selector:
matchLabels:
monitored: "true"

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true

sla:
minSuccessRate: 95
windowDays: 7

alerting:
channelRefs:
- name: ops-slack

Cluster-Wide Monitor

Watch critical jobs across all namespaces:

cluster-wide.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: cluster-critical
namespace: cronjob-guardian
spec:
selector:
allNamespaces: true
matchLabels:
tier: critical
matchExpressions:
- key: skip-monitoring
operator: DoesNotExist

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 1

sla:
minSuccessRate: 99
windowDays: 30

alerting:
channelRefs:
- name: pagerduty-critical
- name: ops-slack

Low-Priority Batch Monitor

Relaxed monitoring for non-critical batch jobs:

batch-jobs.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: batch-jobs
namespace: batch
spec:
selector:
matchLabels:
tier: low

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 5

sla:
minSuccessRate: 80
windowDays: 7

alerting:
channelRefs:
- name: batch-slack

alertDelay: 30m
suppressDuplicatesFor: 4h

severityOverrides:
jobFailed: warning
deadManTriggered: warning
slaBreached: warning

dataRetention:
retentionDays: 30
storeLogs: false

Financial Reports Monitor

Business-critical reports with maintenance windows:

financial-reports.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: financial-reports
namespace: finance
spec:
selector:
matchLabels:
type: financial-report

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 1

sla:
minSuccessRate: 100
windowDays: 30
maxDuration: 1h

maintenanceWindows:
# Month-end processing
- name: month-end
schedule: "0 0 1 * *"
duration: 6h
timezone: America/New_York
# Quarter-end processing
- name: quarter-end
schedule: "0 0 1 1,4,7,10 *"
duration: 12h
timezone: America/New_York

alerting:
channelRefs:
- name: finance-pagerduty
severities:
- critical
- name: finance-slack

severityOverrides:
jobFailed: critical
deadManTriggered: critical

dataRetention:
retentionDays: 365
onCronJobDeletion: retain