Monitoring Lyft Data

Keeping Lyft Data healthy in production means watching the control plane, the workers, and the jobs they execute. Use the practices below to get fast feedback when something drifts from normal.

Built-in observability

Server dashboard (http://<server>:3000) exposes CPU, memory, disk, worker status, job history, and recent alerts.
Run & Trace shows event payloads and validation issues without leaving the UI.
Messages streams the same runtime feed you can watch in real time under Operate -> Messages. Subscribe to job, worker, or warning channels to catch changes immediately.
Issues panel on each job surfaces validation errors, connector warnings, and retry details. See Logs & Issues for an end-to-end view of how runtime messages reach the UI.

Quick checks from the shell

# Add an Authorization header or session cookie if your deployment requires it.
# Server heartbeat and version info
curl -s http://<server>:3000/api/health | jq '{status: .status, version: .version}'

# Registered workers and their state
curl -s http://<server>:3000/api/workers | jq '[.[] | {id: .id, status: .status, last_seen: .last_seen}]'

# Jobs with recent activity (limit to the last 20 runs)
curl -s "http://<server>:3000/api/jobs?limit=20" | jq '.[].name'

Key metrics to watch

Metric	Healthy target	Investigate when
CPU usage	< 70% sustained	> 85% for 10+ minutes
Memory usage	< 80% sustained	> 90% or swap activity
Worker queue length	< 100 pending jobs	Grows steadily or spikes
Job success rate	> 99%	Drops below 95%
Error rate	< 1%	> 5% over 5 minutes

Collect these metrics via the REST API, your monitoring stack (Prometheus, CloudWatch, etc.), or dashboards surfaced in the UI.

Alerting and integrations

Scrape server metrics from /metrics. Run curl -s http://<server>:3000/metrics | grep lyftdata_ to list the exported series and confirm exact names before you write alert rules.
Forward logs to your central platform (e.g. Loki, ELK, Splunk) for long-term retention and correlation.
Watch worker health by polling /api/workers and raising an alert if status is not online or last_seen grows stale. Example:
Terminal window
```
curl -s http://<server>:3000/api/workers \
  | jq 'map(select(.status != "online"))'
```
Notify on-call channels (PagerDuty, Slack, email) when error budgets are exceeded or pipelines stall.

Health checks and diagnostics

API endpoints: GET /api/health confirms server readiness; GET /api/workers lists registered workers.
Synthetic jobs: schedule a tiny “canary” job that runs every few minutes and alerts if it fails.
Audit context: capture a snapshot from the Context workspace or call /api/contexts/global during incidents to record configuration changes.

Troubleshooting signals

Jobs stuck in Staged: verify target workers are online and the scheduling queue is clear.
Rising retries on a connector: inspect connector logs and downstream APIs for throttling or auth errors.
High backlog with low CPU: scale out workers or increase job concurrency; see the Scaling guide.

What’s next

Hardening operations continues in the rest of the Operate section.
For immediate triage, pair this guide with the Troubleshooting reference.