Suggested Fixes
CronJob Guardian analyzes failure context and provides actionable fix suggestions in alerts. This helps on-call engineers quickly understand what went wrong and how to fix it.
Overview
When a job fails, CronJob Guardian:
- Analyzes exit codes, termination reasons, logs, and events
- Matches against built-in and custom patterns
- Includes relevant fix suggestions in the alert
Built-in Patterns
| Pattern | Trigger | Suggestion |
|---|---|---|
| OOMKilled | Reason: OOMKilled | Increase resources.limits.memory |
| SIGKILL (137) | Exit code 137 | Check for OOM, inspect pod state |
| SIGTERM (143) | Exit code 143 | Check activeDeadlineSeconds or eviction |
| ImagePullBackOff | Reason match | Verify image name and imagePullSecrets |
| CrashLoopBackOff | Reason match | Check application startup logs |
| ConfigError | Reason: CreateContainerConfigError | Verify Secret/ConfigMap references |
| DeadlineExceeded | Reason match | Increase deadline or optimize job |
| BackoffLimitExceeded | Reason match | Check logs from failed attempts |
| Evicted | Reason match | Check node pressure, set pod priority |
| FailedScheduling | Event pattern | Check resources, taints, affinity |
Custom Patterns
Define custom patterns to match application-specific failures:
spec:
alerting:
suggestedFixPatterns:
- name: db-connection-failed
match:
logPattern: "connection refused.*:5432|ECONNREFUSED"
suggestion: |
PostgreSQL connection failed. Check:
kubectl get pods -n {{.Namespace}} -l app=postgres
priority: 150
- name: s3-access-denied
match:
logPattern: "AccessDenied|NoCredentialProviders"
suggestion: |
S3 access denied. Verify:
1. IAM role attached to service account
2. Bucket policy allows access
priority: 140
- name: redis-timeout
match:
logPattern: "redis.*timeout|ETIMEDOUT.*:6379"
suggestion: |
Redis connection timeout. Check Redis health:
kubectl exec -n {{.Namespace}} deploy/redis -- redis-cli ping
priority: 130
Match Conditions
Patterns can match on multiple conditions:
Exit Code
match:
exitCode: 1
Termination Reason
match:
reason: "OOMKilled"
Log Pattern (Regex)
match:
logPattern: "FATAL.*database connection failed"
Event Pattern (Regex)
match:
eventPattern: "FailedScheduling.*Insufficient memory"
Combined Conditions
All specified conditions must match:
match:
exitCode: 1
logPattern: "connection refused"
Priority System
Patterns are matched in priority order:
- Built-in patterns: priority 1-100
- Custom patterns: use priority 101+ to override built-ins
Higher priority patterns are checked first.
Template Variables
Suggestions support Go template variables:
| Variable | Description | Example |
|---|---|---|
{{.Namespace}} | CronJob namespace | production |
{{.Name}} | CronJob name | daily-backup |
{{.JobName}} | Job name (with timestamp) | daily-backup-28374658 |
{{.ExitCode}} | Container exit code | 137 |
{{.Reason}} | Termination reason | OOMKilled |
Pattern Tester
Test patterns before deploying via the UI:
- Go to Settings > Pattern Tester
- Enter match criteria (exit code, reason, log sample)
- Define your pattern
- Click Test to see if it matches
This helps validate patterns without waiting for real failures.
Example: Complete Pattern Configuration
full-pattern-example.yaml
apiVersion: guardian.illenium.net/v1alpha1
kind: CronJobMonitor
metadata:
name: production-jobs
namespace: production
spec:
selector:
matchLabels:
tier: critical
alerting:
channelRefs:
- name: team-slack
suggestedFixPatterns:
# Database patterns
- name: postgres-connection
match:
logPattern: "could not connect to server|connection refused.*:5432"
suggestion: |
PostgreSQL unreachable. Debug steps:
1. Check postgres pod: kubectl get pods -n {{.Namespace}} -l app=postgres
2. Check pg_isready: kubectl exec deploy/postgres -- pg_isready
3. Check secrets: kubectl get secret postgres-credentials -o yaml
priority: 150
- name: postgres-auth-failed
match:
logPattern: "password authentication failed|FATAL.*authentication"
suggestion: |
PostgreSQL authentication failed. Verify credentials match:
kubectl get secret postgres-credentials -n {{.Namespace}}
priority: 149
# API patterns
- name: api-rate-limited
match:
logPattern: "429|rate limit|too many requests"
suggestion: |
External API rate limit hit. Consider:
1. Reduce batch size in job config
2. Add delays between API calls
3. Contact API provider for limit increase
priority: 140
# Infrastructure patterns
- name: disk-full
match:
logPattern: "no space left on device|ENOSPC"
suggestion: |
Disk space exhausted. Check PVC usage:
kubectl exec -n {{.Namespace}} job/{{.JobName}} -- df -h
priority: 160
Alert Example
When a job fails with matching pattern:
🚨 Job Failed: daily-backup
Exit Code: 137 (SIGKILL)
Reason: OOMKilled
💡 Suggested Fix:
Container was killed due to OOM. Increase memory limit:
kubectl patch cronjob daily-backup -p '
spec:
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
resources:
limits:
memory: 2Gi
'
Current limit: 512Mi
Recommended: 1-2Gi based on recent usage
Pod Logs (last 10 lines):
...
Best Practices
- Start with built-ins: Built-in patterns cover most common failures
- Add app-specific patterns: Create patterns for your known failure modes
- Use specific matches: More specific patterns prevent false matches
- Include actionable commands: Give exact kubectl commands when possible
- Test before deploying: Use the Pattern Tester to validate
- Review periodically: Add patterns for recurring issues
Related
- Alerting Configuration - Configure alert channels
- Dashboard - View failure details
- Examples - Complete monitor examples