Skip to content

Monitoring LyftData

Keeping LyftData healthy in production means watching the control plane, the workers, and the jobs they execute. Use the practices below to get fast feedback when something drifts from normal.

Built-in observability (UI)

  • Dashboard: high-level health, recent job activity, and key charts.
  • Jobs: per-job deploy state, run history, and issues.
  • Workers: which workers are online, what they’re running, and their current limits.
  • Metrics Explorer: query stored metrics by job/worker and time range.
  • Observe → Logs / Problems / Messages: historical logs, “what’s broken right now”, and a live event stream.

Quick checks from the shell

Terminal window
# Liveness (unauthenticated). Add `-k` if you are using the default self-signed cert.
curl -fsS https://<server>:3000/api/liveness
# Health summary (admin auth required)
curl -fsS -H "Authorization: Bearer <admin-token>" \
https://<server>:3000/api/health | jq '{status: .status, version: .version}'

If you have the CLI configured (see CLI reference), you can also run:

Terminal window
lyftdata doctor
lyftdata workers list
lyftdata jobs list

Key signals to alert on

SignalInvestigate whenTypical response
Workers offlineAny production worker is unexpectedly offlineCheck connectivity/TLS, restart worker, or replace host
Backlog growingQueue depth or “pending work” trends up over several minutesAdd workers, reduce job cadence, or tune expensive steps
Deployments stuckJobs sit in staged/deploying states unusually longCheck worker availability, review Issues, redeploy
Error spikesErrors or retries rise suddenlyInvestigate downstream systems before scaling
Disk pressureStaging/log storage approaches your limitsIncrease disk, tighten retention, or move staging to a larger volume
License riskLicense nearing expiry or limits being hitResolve licensing before it blocks production runs

Most teams start with the UI dashboards and add alerts as the “normal” baseline becomes clear for their workloads.

Alerting and integrations

  • Forward host-level logs to your central platform (Loki/ELK/Splunk/etc.) for long-term retention and correlation.
  • For LyftData telemetry (logs/issues/metrics), prefer the UI surfaces and CLI (lyftdata workers logs <worker-id>, lyftdata workers metrics <worker-id>).
  • If you run collectors from fixed IPs, consider the server allowlist (--whitelist / LYFTDATA_API_WHITELIST) which enables read-only access to selected worker telemetry endpoints when the collector sends Authorization: WHITELIST (see Security hardening).
  • Notify on-call channels when pipelines stall, workers flap, or error budgets are exceeded.

Health checks and diagnostics

  • HTTP probes: GET /api/liveness for basic reachability; GET /api/health for a richer status summary.
  • CLI: lyftdata doctor and lyftdata server health for guided checks.
  • Synthetic jobs: schedule a tiny canary job that runs every few minutes and alerts if it fails.

Troubleshooting signals

  • Jobs stuck in Staged: verify target workers are online and the scheduling queue is clear.
  • Rising retries on a connector: inspect connector logs and downstream APIs for throttling or auth errors.
  • High backlog with low CPU: scale out workers or increase job concurrency; see the Scaling guide.

What’s next

  • Hardening operations continues in the rest of the Operate section.
  • For immediate triage, pair this guide with the Troubleshooting reference.