Monitoring Lyft Data
Keeping Lyft Data healthy in production means watching the control plane, the workers, and the jobs they execute. Use the practices below to get fast feedback when something drifts from normal.
Built-in observability
- Server dashboard (
http://<server>:3000) exposes CPU, memory, disk, worker status, job history, and recent alerts. - Run & Trace shows event payloads and validation issues without leaving the UI.
- Messages streams the same runtime feed you can watch in real time under Operate -> Messages. Subscribe to job, worker, or warning channels to catch changes immediately.
- Issues panel on each job surfaces validation errors, connector warnings, and retry details. See Logs & Issues for an end-to-end view of how runtime messages reach the UI.
Quick checks from the shell
# Add an Authorization header or session cookie if your deployment requires it.# Server heartbeat and version infocurl -s http://<server>:3000/api/health | jq '{status: .status, version: .version}'
# Registered workers and their statecurl -s http://<server>:3000/api/workers | jq '[.[] | {id: .id, status: .status, last_seen: .last_seen}]'
# Jobs with recent activity (limit to the last 20 runs)curl -s "http://<server>:3000/api/jobs?limit=20" | jq '.[].name'Key metrics to watch
| Metric | Healthy target | Investigate when |
|---|---|---|
| CPU usage | < 70% sustained | > 85% for 10+ minutes |
| Memory usage | < 80% sustained | > 90% or swap activity |
| Worker queue length | < 100 pending jobs | Grows steadily or spikes |
| Job success rate | > 99% | Drops below 95% |
| Error rate | < 1% | > 5% over 5 minutes |
Collect these metrics via the REST API, your monitoring stack (Prometheus, CloudWatch, etc.), or dashboards surfaced in the UI.
Alerting and integrations
- Scrape server metrics from
/metrics. Runcurl -s http://<server>:3000/metrics | grep lyftdata_to list the exported series and confirm exact names before you write alert rules. - Forward logs to your central platform (e.g. Loki, ELK, Splunk) for long-term retention and correlation.
- Watch worker health by polling
/api/workersand raising an alert ifstatusis notonlineorlast_seengrows stale. Example:Terminal window curl -s http://<server>:3000/api/workers \| jq 'map(select(.status != "online"))' - Notify on-call channels (PagerDuty, Slack, email) when error budgets are exceeded or pipelines stall.
Health checks and diagnostics
- API endpoints:
GET /api/healthconfirms server readiness;GET /api/workerslists registered workers. - Synthetic jobs: schedule a tiny “canary” job that runs every few minutes and alerts if it fails.
- Audit context: capture a snapshot from the Context workspace or call
/api/contexts/globalduring incidents to record configuration changes.
Troubleshooting signals
- Jobs stuck in Staged: verify target workers are online and the scheduling queue is clear.
- Rising retries on a connector: inspect connector logs and downstream APIs for throttling or auth errors.
- High backlog with low CPU: scale out workers or increase job concurrency; see the Scaling guide.
What’s next
- Hardening operations continues in the rest of the Operate section.
- For immediate triage, pair this guide with the Troubleshooting reference.