Skip to main content

Alert Configuration

Configure how CronJob Guardian sends alerts for monitored CronJobs.

Channel References

Link monitors to alert channels:

spec:
alerting:
channelRefs:
- name: team-slack
- name: ops-pagerduty

Alerts are sent to all referenced channels based on their configuration.

Severity Levels

CronJob Guardian uses three severity levels:

SeverityDescriptionDefault Use
criticalImmediate attention requiredConfigurable
warningAttention needed soonConfigurable
infoInformational onlyStatus updates

Severity Overrides

Customize severity per alert type:

spec:
alerting:
severityOverrides:
jobFailed: warning # Job failures
deadManTriggered: critical # Missed schedules
slaBreached: warning # SLA breaches
durationRegression: info # Performance degradation

Routing by Severity

Send different severities to different channels:

spec:
alerting:
channelRefs:
- name: pagerduty-critical
severities:
- critical
- name: team-slack
severities:
- critical
- warning
- info

Alert Suppression

Duplicate Suppression

Prevent alert storms for recurring failures:

spec:
alerting:
suppressDuplicatesFor: 1h # Suppress same alert for 1 hour

Alert Delay

Wait before alerting for transient issues:

spec:
alerting:
alertDelay: 5m # Wait 5 min before sending

Useful for flaky jobs that often recover on retry.

Combined Example

spec:
alerting:
alertDelay: 5m
suppressDuplicatesFor: 1h
channelRefs:
- name: team-slack

Alert Context

Logs and Events

Control what context is included in alerts:

spec:
alerting:
includeContext:
logs: true # Include pod logs
events: true # Include Kubernetes events
podStatus: true # Include pod status details
logLines: 50 # Number of log lines

Suggested Fixes

Enable intelligent fix suggestions:

spec:
alerting:
includeSuggestedFixes: true
suggestedFixPatterns:
- name: custom-pattern
match:
logPattern: "connection refused"
suggestion: "Check database connectivity"
priority: 150

See Suggested Fixes for details.

Complete Examples

Standard Team Monitor

team-alerting.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: team-jobs
namespace: production
spec:
selector:
matchLabels:
team: platform

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true

sla:
minSuccessRate: 95
windowDays: 7

alerting:
channelRefs:
- name: platform-slack

alertDelay: 2m
suppressDuplicatesFor: 30m

severityOverrides:
jobFailed: warning
deadManTriggered: critical
slaBreached: warning

includeContext:
logs: true
events: true
logLines: 30

includeSuggestedFixes: true

Critical with Escalation

critical-escalation.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: critical-jobs
namespace: production
spec:
selector:
matchLabels:
tier: critical

deadManSwitch:
enabled: true
autoFromSchedule:
enabled: true
missedScheduleThreshold: 1

sla:
minSuccessRate: 99.9
windowDays: 30

alerting:
channelRefs:
# Critical goes to PagerDuty
- name: pagerduty-critical
severities:
- critical
# All severities to Slack
- name: ops-slack
severities:
- critical
- warning
- info

# No delay for critical jobs
alertDelay: 0s
suppressDuplicatesFor: 15m

severityOverrides:
jobFailed: critical
deadManTriggered: critical
slaBreached: critical

includeContext:
logs: true
events: true
podStatus: true
logLines: 100

Low-Priority with Aggregation

low-priority.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: batch-jobs
namespace: batch
spec:
selector:
matchLabels:
tier: low

sla:
minSuccessRate: 80
windowDays: 7

alerting:
channelRefs:
- name: batch-slack
severities:
- critical
- warning

# Generous delays for low-priority
alertDelay: 15m
suppressDuplicatesFor: 4h

severityOverrides:
jobFailed: info # Failures are just info
deadManTriggered: warning # Missing is warning
slaBreached: warning

includeContext:
logs: true
logLines: 20

Configuration Reference

FieldTypeDescriptionDefault
channelRefs[]ChannelRefAlert channels to notifyRequired
channelRefs[].namestringAlertChannel resource nameRequired
channelRefs[].severities[]stringSeverities to send to this channelAll
alertDelaydurationWait before sending alert0s
suppressDuplicatesFordurationSuppress duplicate alerts0s
severityOverridesmapOverride default severities-
includeContextobjectWhat to include in alerts-
includeSuggestedFixesboolInclude fix suggestionstrue
suggestedFixPatterns[]PatternCustom fix patterns-