Skip to content

Operate and Scale

Lyft Data operations focus on keeping the control plane healthy, the job fleet productive, and telemetry flowing to the right places. Use this page as the jumping-off point for your runbooks.

Daily checklist

  • Watch the live job status feed to confirm deployments and executions are progressing.
  • Track worker health in the UI or metrics stack; investigate high queue depth or stalled workers immediately.
  • Review alerts and recent log spikes-most production issues surface first through /var/log/lyftdata.log or your central logging sink.

Runbooks by theme

  • Daily operations - Daily operations playbook keeps the control plane healthy with checklists and drills.
  • Observability & alerts - Monitoring Lyft Data covers metrics, dashboards, and alert wiring.
  • Resilience & recovery - Backup & recovery explains snapshot cadence, restores, and disaster recovery tests.
  • Worker provisioning - Worker auto enrollment covers shared-secret bootstrap flows and what to disable afterwards.
  • Capacity planning - Scaling Lyft Data walks through worker sizing, channel fan-out strategies, and deployment hygiene.
  • Security posture - Security hardening documents TLS, secret rotation, and RBAC guidance.

Releases and change management

  • Watch Operate -> Releases for published changelogs and upgrade notes.
  • Capture outstanding operational work in the shared trackers under docs/operate/ to keep improvements visible to the team.

Where to go next