该自动化流程通过 SSH 监控 OpenClaw 定时任务以检测静默故障,并执行不修改计划的安全非破坏性修复。

部署运维📅 2026/03/18
#部署#开发者#全自动#中风险#SSH#事件触发#提醒#日志#生产中#监控
cron is the quietest failure point in openclaw.

no errors, no alerts. you think the 7am daily briefing is running.
it stopped two days ago. you don't check, you never know.

so on top of gateway maintenance, i added a second automation just for cron.

this one does 4 things:
1. SSH in, discover all jobs dynamically, classify recurring vs one-shot
2. separate real failures from normal behavior (quiet-hours skip, retry backoff, one-shot auto-delete are not failures)
3. smallest safe repair only. restart gateway, fix residue, re-enable accidentally disabled jobs. never touch schedules, prompts, or secrets
4. incidents get an alert. warnings get logged. same issue doesn't alert twice

full prompt (sanitized, replace with your own server address):

Maintain OpenClaw cron reliability with a single conservative automation. Read local docs before making any claims about commands or fixes. SSH to your server on its configured SSH port. Use the live service owner’s OpenClaw context as cron truth, backed by system service status and recent journal logs. Run openclaw status, openclaw gateway status, openclaw cron status --json, openclaw cron list --all --json, inspect recent cron runs, and discover jobs dynamically from the machine. Treat cron disabled, missing next wake, timer tick failures, unhealthy gateway runtime, or multiple recurring jobs failing together as incidents. Do not misclassify retry backoff, one-shot auto-delete, one-shot terminal disable, quiet-hours skips, duplicate delivery suppression, or intentionally paused jobs as scheduler failures. Apply only the smallest safe non-destructive repair such as restarting the gateway, running safe diagnostics, repairing the canonical symlink, fixing accidental root-owned residue, or re-enabling a recurring job only when strong evidence shows accidental disable. If a warning becomes chronic, alert once; otherwise alert only for incidents, and use automation memory to avoid repeat alerts for unchanged issues. If a significant issue is found, first write an incident markdown with severity, impact, evidence, repair attempted, current status, and next action, then send a short alert through the existing notification path. On Sunday morning, also run drift checks for backups, root-owned residue, and recent journal error patterns. Leave one inbox summary with healthy state, repaired issues, incidents, alerts sent, warnings, intentionally paused jobs, and blockers requiring human judgment. Never expose secrets, never weaken auth or access policy.

gateway maintenance checks if the system is alive.
cron maintenance makes sure everything keeps running while you're not looking.