Skip to content

Daily Operations

This playbook is aimed at operators and SREs who need a repeatable cadence for keeping Lyft Data healthy. Use it alongside the detailed runbooks in Operate & Scale.

Daily checklist (15 minutes)

  • Review the dashboard for worker status, queue depth, and recent alerts.
  • Scan the job status feed for stalled deployments or retry storms.
  • Check licensing state in the UI or via /api/license/show so Community Edition limits or expiring keys are flagged early.
  • Spot-check server and worker logs for new error signatures (journalctl -u lyftdata-server, worker logs).

Weekly tasks

  • Review worker utilization trends in the monitoring guide and plan scale-up if CPU or queue depth is trending high.
  • Validate backups and retention using the backup & recovery checklist.
  • Audit user accounts and API keys (rotate stale credentials, remove unused workers).
  • Capture notable changes (new connectors, job migrations) in your team runbook.

Before deploying changes

  • Stage jobs and verify via Run & Trace in lower environments.
  • Check the release notes for upgrade guidance or known issues.
  • Ensure alerting/metrics dashboards reflect any new jobs or channels.

Incident drills & readiness

  • Rehearse the escalation path for worker failures (who owns remediation?).
  • Test the process for draining jobs and restarting workers safely.
  • Validate that error budgets or SLIs are defined and monitored (pair with scaling guidance).

Resources

Keep this page bookmarked as the starting point for day-to-day operations and link it in your incident response handbook.