Operate and Scale
Lyft Data operations focus on keeping the control plane healthy, the job fleet productive, and telemetry flowing to the right places. Use this page as the jumping-off point for your runbooks.
Daily checklist
- Watch the live job status feed to confirm deployments and executions are progressing.
- Track worker health in the UI or metrics stack; investigate high queue depth or stalled workers immediately.
- Review alerts and recent log spikes-most production issues surface first through
/var/log/lyftdata.logor your central logging sink.
Runbooks by theme
- Daily operations - Daily operations playbook keeps the control plane healthy with checklists and drills.
- Observability & alerts - Monitoring Lyft Data covers metrics, dashboards, and alert wiring.
- Resilience & recovery - Backup & recovery explains snapshot cadence, restores, and disaster recovery tests.
- Worker provisioning - Worker auto enrollment covers shared-secret bootstrap flows and what to disable afterwards.
- Capacity planning - Scaling Lyft Data walks through worker sizing, channel fan-out strategies, and deployment hygiene.
- Security posture - Security hardening documents TLS, secret rotation, and RBAC guidance.
Releases and change management
- Watch Operate -> Releases for published changelogs and upgrade notes.
- Capture outstanding operational work in the shared trackers under
docs/operate/to keep improvements visible to the team.
Where to go next
- Follow the Daily operations playbook for your everyday checklist and weekly reviews.
- Set up dashboards using the Monitoring guide and plan capacity with the Scaling runbook.
- Harden your deployment via Security guidance and Backup & recovery.
- Track upcoming changes in the release notes and communicate upgrades with stakeholders.