Skip to content

Operate and Scale

LyftData operations focus on keeping the control plane healthy, the job fleet productive, and telemetry flowing to the right places. Use this page as the jumping-off point for your runbooks.

Daily checklist

  • Confirm the server is reachable (for example GET /api/liveness) and that you can sign in.
  • Watch the live job status feed for stalled deploys, long retries, or sudden error spikes.
  • Track worker health in the UI; investigate offline workers and growing backlogs quickly.
  • Review errors and warnings in Logs & Issues and in your host logging system (systemd journal, Windows Event Log, or your central logging sink).

Runbooks by theme

  • Daily operations - Daily operations playbook keeps the control plane healthy with checklists and drills.
  • Observability & alerts - Monitoring LyftData covers metrics, dashboards, and alert wiring.
  • Logs and live events - Logs & Issues and Messages are your first stop for triage.
  • Resilience & recovery - Backup & recovery explains snapshot cadence, restores, and disaster recovery tests.
  • Worker provisioning - Worker auto enrollment covers shared-secret bootstrap flows and what to disable afterwards.
  • Capacity planning - Scaling LyftData walks through worker sizing, channel fan-out strategies, and deployment hygiene.
  • Security posture - Security hardening documents TLS, secret rotation, and RBAC guidance.
  • Telemetry - Telemetry explains what LyftData collects locally and how to access it.

Releases and change management

  • Before upgrades, note your current version (lyftdata --version) and review the release notes.
  • Use the downloads portal for current builds and checksums.
  • Keep a simple change log for your environment (what changed, who approved it, and how to roll back).

Where to go next