Cron Job Monitoring: Complete Guide to Preventing Silent Failures (2026)
Cron Job Monitoring: Complete Guide to Preventing Silent Failures (2026)
Published: March 2026
Category: Guides
Reading time: 12 minutes
Your cron jobs run every hour, every day, every week. But how do you know they're actually working?
Silent failures are the nightmare scenario: your backup script hasn't run in 3 weeks, your data sync stopped 5 days ago, or your report generation failed last Monday — and nobody noticed until production broke.
In this guide, we'll show you exactly how to monitor cron jobs to catch failures before they cause damage.
What Is Cron Job Monitoring?
Cron job monitoring is a dead man's switch for scheduled tasks. Instead of checking if a process is running, you check if it completes successfully.
How It Works
Traditional approach (doesn't work):
# This only tells you if cron is running, not if YOUR job succeeds
ps aux | grep cron
Cron job monitoring (actually works):
# Your job pings a monitoring service when it completes
0 2 * * * /path/to/backup.sh && curl -X POST https://cronmonitor.example/ping/abc123
If the ping doesn't arrive by 2:05 AM, you get an alert. Simple.
Why You Need Cron Job Monitoring
1. Silent Failures Are Common
Cron jobs fail silently for dozens of reasons:
- Disk full - script exits early, no error shown
- Permission denied - changed file ownership, job can't write
- Dependency missing - package updated, broke your script
- API rate limit - external service rejects your request
- Network timeout - slow connection kills the job
- Database locked - another process holding a lock
- Configuration drift - environment variable changed
- Resource exhaustion - server ran out of memory
The problem: cron doesn't alert you. It just logs to a file nobody reads.
2. Production Disasters Start Small
Real-world example:
A SaaS company ran a nightly database backup cron job. It failed 12 days ago when the backup directory filled up. Nobody noticed until their primary database crashed.
Cost of that failure:
- 12 days of data lost
- 4 hours of downtime
- £15,000 in refunds
- Reputational damage
Prevention cost: £8/month for cron monitoring.
3. You Can't Manually Check Everything
Typical production server:
- 10-50 cron jobs running
- Different schedules (hourly, daily, weekly)
- Different owners (dev, ops, data team)
- Different criticality levels
Manual checks don't scale. Monitoring does.
How to Monitor Cron Jobs (Step by Step)
Step 1: Add a Ping Endpoint to Your Script
Every cron job should ping a monitoring service when it completes:
#!/bin/bash
# backup.sh
# Your actual backup logic
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
# If backup succeeds, ping the monitor
if [ $? -eq 0 ]; then
curl -X POST https://cronmonitor.swiftlabs.dev/api/ping/YOUR_TOKEN_HERE
fi
Key points:
- Only ping after the work completes
- Use
$?to check exit code (0 = success) - Use POST or GET (both work)
- Keep it at the end so failures don't trigger the ping
Step 2: Set the Expected Interval
Your monitoring service needs to know when to expect the ping:
- Daily job at 2 AM → expect ping every 24 hours
- Hourly job → expect ping every 60 minutes
- Weekly job → expect ping every 7 days
Grace period: Add 5-10 minutes buffer for slow scripts.
Example:
- Schedule:
0 2 * * *(daily at 2 AM) - Expected interval: 24 hours
- Grace period: 10 minutes
- Alert if no ping by: 2:10 AM the next day
Step 3: Configure Alerts
When a cron job misses its expected check-in, you need to know immediately.
Alert channels:
- Email - reliable, works everywhere
- Slack - team visibility, threaded discussion
- Discord - developer communities
- Webhook - integrate with PagerDuty, Opsgenie, etc.
- SMS - critical jobs only (costs per message)
Recovery alerts: When a missed job finally runs, get a "recovered" notification.
Step 4: Test Your Monitoring
Don't wait for a real failure to find out monitoring doesn't work.
Test process:
- Set up a test cron job (runs every 5 minutes)
- Let it ping successfully 2-3 times
- Stop the job (comment out the crontab line)
- Wait for the alert (should arrive within grace period)
- Restart the job
- Verify you get a "recovered" alert
Red flags:
- Alert didn't arrive (check email filters, webhook config)
- Alert arrived late (grace period too long)
- Multiple false alerts (grace period too short)
Advanced Monitoring Patterns
1. Monitor Script Exit Codes
Different exit codes mean different failures:
#!/bin/bash
# backup.sh
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
# Success - ping the monitor
curl -X POST https://cronmonitor.example/ping/abc123
else
# Failure - ping with error code
curl -X POST "https://cronmonitor.example/ping/abc123?exit_code=$EXIT_CODE"
fi
Your monitoring service can track how jobs fail, not just that they failed.
2. Track Job Duration
Slow jobs often indicate problems:
#!/bin/bash
START_TIME=$(date +%s)
# Your job logic here
/path/to/heavy_task.sh
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
curl -X POST "https://cronmonitor.example/ping/abc123?duration=$DURATION"
Why this matters:
- Job that normally takes 5 minutes suddenly takes 45 minutes → database performance issue
- Incremental trend (5min → 6min → 8min → 12min) → data volume growing, optimization needed
3. Monitor Multi-Step Jobs
Complex cron jobs have multiple stages:
#!/bin/bash
# data-pipeline.sh
# Stage 1: Extract
curl https://api.example.com/data > /tmp/data.json || exit 1
# Stage 2: Transform
python3 /scripts/transform.py /tmp/data.json > /tmp/transformed.csv || exit 2
# Stage 3: Load
psql -c "COPY mytable FROM '/tmp/transformed.csv'" || exit 3
# All stages complete - ping monitor
curl -X POST https://cronmonitor.example/ping/abc123
Different exit codes (1, 2, 3) tell you which stage failed.
4. Create Job Groups
Related jobs should be monitored together:
Example: E-commerce nightly batch
generate_reports.sh- must complete by 6 AMsend_reports.sh- depends on reports, must complete by 7 AMcleanup_temp_files.sh- runs after reports, must complete by 8 AM
Group monitoring shows dependencies:
- If
generate_reportsfails,send_reportswill also fail - If
cleanupfails but reports succeed, it's low priority
Common Cron Job Monitoring Mistakes
❌ Mistake 1: Ping at the Start, Not the End
# WRONG - pings before work is done
curl -X POST https://monitor.example/ping/abc123
/path/to/backup.sh # If this fails, monitor thinks it succeeded
Fix: Always ping after the work completes.
❌ Mistake 2: No Grace Period
Scenario:
- Cron job scheduled:
0 2 * * * - Expected interval: exactly 24 hours
- Job takes 3 minutes to complete
Problem: Job runs at 2:00 AM but doesn't ping until 2:03 AM. Monitor sees this as "3 minutes late" and sends a false alert.
Fix: Add grace period (5-10 minutes for normal jobs, 30+ minutes for heavy jobs).
❌ Mistake 3: Monitoring the Wrong Thing
# WRONG - monitors if the cron daemon is running
*/5 * * * * systemctl is-active cron && curl https://monitor.example/ping/abc123
This tells you if cron itself is running, not if your job succeeds.
Fix: Monitor the actual work, not the scheduler.
❌ Mistake 4: Same Token for Multiple Jobs
# WRONG - both jobs use the same token
0 2 * * * /backup.sh && curl https://monitor.example/ping/abc123
0 3 * * * /cleanup.sh && curl https://monitor.example/ping/abc123
Problem: You can't tell which job failed.
Fix: One unique token per job.
❌ Mistake 5: No Alert Testing
You set up monitoring, assume it works, never test it.
3 months later: Job fails, no alert arrives, you find out when a customer complains.
Fix: Test alerts every time you set up a new monitor.
Cron Job Monitoring Checklist
Before you call a cron job "production ready," verify:
- Script pings monitor on success (curl/wget at the end)
- Expected interval configured (matches cron schedule)
- Grace period set (5-10 min for fast jobs, 30+ for slow)
- Alert destination tested (email/Slack/webhook works)
- Recovery alert enabled (know when it starts working again)
- Exit codes logged (helps with debugging)
- Duration tracked (catch performance degradation early)
- Documentation exists (who owns this job? what does it do?)
Choosing a Cron Job Monitoring Service
What to Look For
1. Simple setup
- Webhook/ping URL (not agent installation)
- Works with any language (bash, Python, Node, etc.)
- No code changes to existing scripts
2. Flexible scheduling
- Handles irregular intervals (weekly, monthly, custom)
- Grace period configuration
- Timezone support
3. Reliable alerting
- Multiple channels (email, Slack, webhook)
- No missed alerts (99.9%+ uptime)
- Clear "down" vs "recovered" notifications
4. Useful history
- Shows last 10-50 pings
- Tracks duration trends
- Logs exit codes
5. Fair pricing
- Free tier for small projects (3-5 monitors)
- Affordable paid tier (£5-15/month)
- No per-alert billing
Popular Options
| Service | Free Tier | Price | Best For |
|---|---|---|---|
| CronMonitor | 3 monitors | £8/month unlimited | Simple ping-based monitoring |
| Healthchecks.io | 20 monitors | $5/month (80 checks) | Open source, self-hostable |
| Cronitor | 5 monitors | $10/month | Advanced features, integrations |
| Better Uptime | 10 monitors | $20/month | Enterprise, incident management |
| Dead Man's Snitch | 0 (paid only) | $5/month (5 snitches) | Minimal, focused |
Recommendation: Start with a free tier, test it for a week, then upgrade if it works for you.
Self-Hosted Cron Monitoring
Don't want to pay for a service? You can build your own.
Minimal Self-Hosted Monitor (20 Lines)
# monitor.py - run this as a web service
from flask import Flask, request
import time
app = Flask(__name__)
last_ping = {}
@app.route('/ping/<token>', methods=['GET', 'POST'])
def ping(token):
last_ping[token] = time.time()
return "OK", 200
@app.route('/check/<token>')
def check(token):
if token not in last_ping:
return "Never pinged", 404
age = time.time() - last_ping[token]
if age > 86400: # 24 hours
return f"LATE: {age/3600:.1f} hours since last ping", 500
return f"OK: Last ping {age/60:.0f} minutes ago", 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Usage:
# Your cron job pings it
curl http://localhost:5000/ping/backup_job
# Check status manually
curl http://localhost:5000/check/backup_job
Limitations:
- No alerts (add email/Slack integration)
- No persistence (restarts lose data - add SQLite)
- No grace periods (all jobs expect 24h interval)
When to self-host:
- You have <5 jobs
- You don't need alerts
- You already run your own servers
- You want full control
When to use a service:
- You need reliability (99.9% uptime)
- You want alerts without building them
- Your time is worth more than £8/month
Real-World Cron Job Monitoring Examples
1. Database Backups
Schedule: Daily at 2 AM
#!/bin/bash
# /scripts/backup_db.sh
BACKUP_DIR="/backups"
DATE=$(date +%Y%m%d)
MONITOR_URL="https://cronmonitor.swiftlabs.dev/api/ping/db_backup_token"
# Create backup
pg_dump production > "$BACKUP_DIR/production_$DATE.sql"
if [ $? -eq 0 ]; then
# Verify backup is not empty
if [ -s "$BACKUP_DIR/production_$DATE.sql" ]; then
# Backup successful and non-empty
curl -X POST "$MONITOR_URL?status=success"
else
# Backup file is empty - this is a failure
curl -X POST "$MONITOR_URL?status=failure&reason=empty_backup"
exit 1
fi
else
# pg_dump failed
curl -X POST "$MONITOR_URL?status=failure&reason=pgdump_error"
exit 1
fi
Monitor settings:
- Expected interval: 24 hours
- Grace period: 15 minutes
- Alert: Email + Slack
2. API Data Sync
Schedule: Every hour
#!/bin/bash
# /scripts/sync_api_data.sh
START=$(date +%s)
MONITOR_URL="https://cronmonitor.swiftlabs.dev/api/ping/api_sync_token"
# Fetch data from API
curl -s https://api.example.com/data | jq '.' > /tmp/api_data.json
if [ ${PIPESTATUS[0]} -eq 0 ]; then
# Process data
python3 /scripts/process_data.py /tmp/api_data.json
END=$(date +%s)
DURATION=$((END - START))
curl -X POST "$MONITOR_URL?duration=$DURATION"
else
curl -X POST "$MONITOR_URL?status=failure&reason=api_fetch_failed"
exit 1
fi
Monitor settings:
- Expected interval: 60 minutes
- Grace period: 5 minutes
- Alert: Slack only (high frequency job)
3. Report Generation
Schedule: Weekly on Monday at 9 AM
#!/bin/bash
# /scripts/generate_weekly_report.sh
MONITOR_URL="https://cronmonitor.swiftlabs.dev/api/ping/weekly_report_token"
# Generate report
Rscript /scripts/weekly_report.R --output /reports/weekly_$(date +%Y%m%d).pdf
if [ $? -eq 0 ]; then
# Email report to stakeholders
echo "Weekly report attached" | mail -s "Weekly Report" -A /reports/weekly_*.pdf team@example.com
# Ping monitor
curl -X POST "$MONITOR_URL"
else
curl -X POST "$MONITOR_URL?status=failure&reason=report_generation_failed"
exit 1
fi
Monitor settings:
- Expected interval: 7 days
- Grace period: 30 minutes
- Alert: Email + SMS (critical business report)
Debugging Failed Cron Jobs
When monitoring alerts you to a failure, here's how to debug:
1. Check Cron Logs
On Linux:
# View recent cron activity
grep CRON /var/log/syslog | tail -20
# Check mail (cron sends output here by default)
mail
On macOS:
# Cron logs to system log
log show --predicate 'process == "cron"' --last 1h
2. Run the Job Manually
# Run as the same user cron uses
sudo -u cronuser /path/to/script.sh
# Check exit code
echo $?
Common manual-vs-cron differences:
- Different
$PATH(cron has minimal PATH) - Different
$HOME(cron might run as different user) - Different environment variables
- Different working directory
3. Add Verbose Logging
#!/bin/bash
# Add at the top of your script
exec 2>> /var/log/myscript_errors.log
set -x # Print every command before executing
# Your script continues...
This logs all errors and shows exactly which command failed.
4. Check Resource Limits
# Check disk space
df -h
# Check memory
free -h
# Check inode usage (can run out even with free space)
df -i
5. Verify Permissions
# Check file ownership
ls -la /path/to/script.sh
# Check directory permissions
ls -lad /path/to/output_directory
# Run with explicit user context
sudo -u cronuser touch /path/to/output_directory/test.txt
Key Takeaways
1. Silent failures are the biggest risk
- Cron doesn't alert you when jobs fail
- Production disasters start with one missed backup
- Monitoring prevents weeks of undetected failures
2. Ping-based monitoring is simple and reliable
- Add one
curlline to your script - No agents, no complicated setup
- Works with any language or platform
3. Test your monitoring
- Don't wait for a real failure
- Stop a job intentionally and verify alerts work
- Test recovery notifications too
4. Set appropriate grace periods
- Too short → false alerts
- Too long → delayed detection
- Start with 5-10 minutes, adjust based on job duration
5. Monitor what matters
- Focus on critical jobs first (backups, data sync, billing)
- Add monitoring to new jobs immediately
- Review monitoring coverage quarterly
Next Steps:
- List all your cron jobs (
crontab -l) - Identify critical jobs (what breaks if this fails?)
- Add monitoring to top 3 critical jobs
- Test alerts
- Expand to remaining jobs