Daily Operations
This playbook is aimed at operators and SREs who need a repeatable cadence for keeping Lyft Data healthy. Use it alongside the detailed runbooks in Operate & Scale.
Daily checklist (15 minutes)
- Review the dashboard for worker status, queue depth, and recent alerts.
- Scan the job status feed for stalled deployments or retry storms.
- Check licensing state in the UI or via
/api/license/showso Community Edition limits or expiring keys are flagged early. - Spot-check server and worker logs for new error signatures (
journalctl -u lyftdata-server, worker logs).
Weekly tasks
- Review worker utilization trends in the monitoring guide and plan scale-up if CPU or queue depth is trending high.
- Validate backups and retention using the backup & recovery checklist.
- Audit user accounts and API keys (rotate stale credentials, remove unused workers).
- Capture notable changes (new connectors, job migrations) in your team runbook.
Before deploying changes
- Stage jobs and verify via Run & Trace in lower environments.
- Check the release notes for upgrade guidance or known issues.
- Ensure alerting/metrics dashboards reflect any new jobs or channels.
Incident drills & readiness
- Rehearse the escalation path for worker failures (who owns remediation?).
- Test the process for draining jobs and restarting workers safely.
- Validate that error budgets or SLIs are defined and monitored (pair with scaling guidance).
Resources
- Monitoring runbook: /operate/monitoring
- Scaling playbook: /operate/scaling
- Backup & recovery: /operate/backup
- Security hardening: /operate/security
Keep this page bookmarked as the starting point for day-to-day operations and link it in your incident response handbook.