SLA Configuration
Configure success rate thresholds and duration monitoring for your CronJobs.
Success Rate
Track the percentage of successful job executions:
spec:
sla:
minSuccessRate: 95
windowDays: 7
| Field | Description | Default |
|---|---|---|
minSuccessRate | Minimum acceptable success percentage (0-100) | 95 |
windowDays | Rolling window for calculation | 7 |
How It Works
- Guardian tracks all job executions within the window
- Calculates
successful / total * 100 - Alerts when rate falls below threshold
Example: 99% SLA
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: critical-jobs
spec:
selector:
matchLabels:
tier: critical
sla:
minSuccessRate: 99
windowDays: 30
alerting:
channelRefs:
- name: pagerduty-critical
Duration Thresholds
Monitor job execution time:
spec:
sla:
maxDuration: 30m
durationRegressionThreshold: 50
durationBaselineWindowDays: 14
| Field | Description | Default |
|---|---|---|
maxDuration | Maximum acceptable duration | - |
durationRegressionThreshold | Percentage increase that triggers alert | 50 |
durationBaselineWindowDays | Window for baseline calculation | 14 |
Absolute Duration
Alert when any execution exceeds a fixed time:
spec:
sla:
maxDuration: 1h
Regression Detection
Detect when jobs are getting slower over time:
spec:
sla:
durationRegressionThreshold: 50
durationBaselineWindowDays: 14
This configuration:
- Calculates baseline from the last 14 days
- Alerts if current duration exceeds baseline by 50%
Combined Example
full-sla.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: etl-pipelines
spec:
selector:
matchLabels:
type: etl
sla:
# Success rate
minSuccessRate: 98
windowDays: 7
# Duration
maxDuration: 2h
durationRegressionThreshold: 30
durationBaselineWindowDays: 14
alerting:
channelRefs:
- name: ops-slack
- name: pagerduty-critical
severities: [critical]
SLA Dashboard
View SLA metrics in the dashboard:
- Go to SLA page
- See compliance percentages per monitor
- Drill down to individual CronJobs
- View historical trends
Alert Types
SLA-related alert types:
| Type | Triggered When |
|---|---|
SLABreach | Success rate drops below threshold |
DurationExceeded | Job takes longer than maxDuration |
DurationRegression | Duration increases beyond baseline |
Best Practices
- Start lenient: Begin with lower thresholds, tighten over time
- Use appropriate windows: Longer windows for less frequent jobs
- Consider maintenance: Factor in planned downtime
- Set per-criticality: Critical jobs need tighter SLAs
Related
- Selectors - CronJob selection patterns
- Alerting - Alert configuration
- SLA Tracking Feature - Feature overview