SLA Configuration

Configure success rate thresholds and duration monitoring for your CronJobs.

Success Rate

Track the percentage of successful job executions:

spec:
  sla:
    minSuccessRate: 95
    windowDays: 7

Field	Description	Default
`minSuccessRate`	Minimum acceptable success percentage (0-100)	95
`windowDays`	Rolling window for calculation	7

How It Works

Guardian tracks all job executions within the window
Calculates successful / total * 100
Alerts when rate falls below threshold

Example: 99% SLA

apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: critical-jobs
spec:
  selector:
    matchLabels:
      tier: critical
  sla:
    minSuccessRate: 99
    windowDays: 30
  alerting:
    channelRefs:
      - name: pagerduty-critical

Duration Thresholds

Monitor job execution time:

spec:
  sla:
    maxDuration: 30m
    durationRegressionThreshold: 50
    durationBaselineWindowDays: 14

Field	Description	Default
`maxDuration`	Maximum acceptable duration	-
`durationRegressionThreshold`	Percentage increase that triggers alert	50
`durationBaselineWindowDays`	Window for baseline calculation	14

Absolute Duration

Alert when any execution exceeds a fixed time:

spec:
  sla:
    maxDuration: 1h

Regression Detection

Detect when jobs are getting slower over time:

spec:
  sla:
    durationRegressionThreshold: 50
    durationBaselineWindowDays: 14

This configuration:

Calculates baseline from the last 14 days
Alerts if current duration exceeds baseline by 50%

Combined Example

full-sla.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
  name: etl-pipelines
spec:
  selector:
    matchLabels:
      type: etl
  sla:
    # Success rate
    minSuccessRate: 98
    windowDays: 7
    # Duration
    maxDuration: 2h
    durationRegressionThreshold: 30
    durationBaselineWindowDays: 14
  alerting:
    channelRefs:
      - name: ops-slack
      - name: pagerduty-critical
        severities: [critical]

SLA Dashboard

View SLA metrics in the dashboard:

Go to SLA page
See compliance percentages per monitor
Drill down to individual CronJobs
View historical trends

Alert Types

SLA-related alert types:

Type	Triggered When
`SLABreach`	Success rate drops below threshold
`DurationExceeded`	Job takes longer than `maxDuration`
`DurationRegression`	Duration increases beyond baseline

Best Practices

Start lenient: Begin with lower thresholds, tighten over time
Use appropriate windows: Longer windows for less frequent jobs
Consider maintenance: Factor in planned downtime
Set per-criticality: Critical jobs need tighter SLAs

Selectors - CronJob selection patterns
Alerting - Alert configuration
SLA Tracking Feature - Feature overview

Success Rate​

How It Works​

Example: 99% SLA​

Duration Thresholds​

Absolute Duration​

Regression Detection​

Combined Example​

SLA Dashboard​

Alert Types​

Best Practices​

Related​