Skip to content

Telemetry

Lyft Data emits a rich telemetry stream so you can understand what the control plane and workers are doing without SSH’ing into hosts. Telemetry covers job lifecycle events, worker heartbeats, system metrics, and the messages that drive the UI status panes. This page explains how the telemetry pipeline works, what data it contains, and how to manage or scope the signals for your environment.

Signals at a Glance

Lyft Data produces several categories of telemetry:

  • Job lifecycle messages – every major job transition (staged, deployed, run started/completed, errors) is published on the internal message bus. UI components consume /api/jobs/subscription today, and the richer /api/messages/subscribe feed will take over once job/worker identifiers are added (ui.job_status.legacy_feed.v1).
  • Worker heartbeats & state – workers send regular heartbeats, system information, and thread-state updates so the control plane can surface online/offline status and basic telemetry (workers.runtime.heartbeat_reporting.v1, workers.runtime.job_coordination.v1).
  • Metrics export – server and worker metrics are exposed both through the built-in Prometheus endpoint (/metrics) and the REST endpoints under /api/workers/metrics//api/workers/volumes-report. See Monitoring for collection examples.
  • Structured logs & alerts – user-facing notifications (message actions, alert banners) ride the same internal message feed that internal telemetry uses (jobs.actions.message_alerts_usage.v1).

Telemetry Pipeline

  1. Workers publish events to the aggregator: job runtime updates, system info, error counts, and heartbeats.
  2. The server aggregates those messages, keeps a short-lived event history, and stores long-term metrics in the metrics database.
  3. Consumers pull data:
    • UI pages subscribe to /api/jobs/subscription and /api/messages/subscribe for live job/worker status.
    • Observability tooling scrapes /metrics or /api/workers/metrics for Prometheus-friendly counters.
    • The optional phone-home telemetry client periodically batches anonymised metrics if telemetry is enabled (TelemetryConfig default in server-state).

Access & Security Controls

Telemetry endpoints honour the same RBAC controls as the rest of the API:

  • Use admin tokens for management routes and api_read tokens for read-only telemetry (see server.security.rbac_roles.v1).
  • The server’s --whitelist/LYFTDATA_API_WHITELIST flag allows trusted IP ranges to reach /api/workers/metrics, /api/workers/volumes-report, and /api/workers/logs without admin tokens – handy for Prometheus or log shippers (server.security.api_whitelist.v1).
  • UI telemetry panels (Job Status feed, Monitoring, Telemetry charts) also honour RBAC and will show a permissions warning if the viewer only has limited scopes.

Configuration & Management

Telemetry collection can be tuned or disabled entirely:

  • Server flag/environment variable – pass --disable-telemetry or set LYFTDATA_DISABLE_TELEMETRY=true on the server to stop phone-home collection and background telemetry tasks (crates/server/src/args.rs).
  • Telemetry configuration – the server’s TelemetryConfig controls whether usage/system/job/worker metrics are shipped, the collection interval, and the endpoint (crates/server-state/src/telemetry.rs). Use the API (PhoneHomeCoordinator::update_telemetry_config) or config files to change defaults.
  • Community Edition – when running without a license, phone-home telemetry stays disabled automatically, but local metrics, heartbeats, and logs still flow so you can monitor the built-in worker (community_edition.license_state_behavior.v1).

Changes to telemetry settings are hot-loaded: the server updates its configuration without requiring a restart, and workers continue to publish their signals.

Working With the Data

  • Monitoring dashboards – scrape /metrics (server-wide counters) and /api/workers/metrics (per-worker snapshots) into Prometheus, Datadog, or your preferred system. Pair that with the alert guidance in Monitoring.
  • Job- and worker-level insights – combine job lifecycle messages with worker heartbeats to detect stuck deployments or crashing workers. The filters on /api/messages/subscribe let you focus on specific message types.
  • Historical analysis – the metrics database stores longer-term aggregates. Use the Explorer APIs (/api/db/explorer/metrics) when you need to query telemetry without scraping raw endpoints.

Operational Best Practices

  • Keep a dedicated api_read token (or IP whitelist) for observability agents so they can poll telemetry endpoints without admin credentials.
  • Watch the Community Edition rate limiter metrics if you run the built-in worker only – they surface daily consumption, remaining quota, and whether rate limiting is active (runtime-common/src/rate_limiter.rs, RuntimeJobState::update_rate_limit_metrics).
  • Use the Job Status feed as the source of truth until execution telemetry grows richer; once /api/messages/subscribe includes job/worker identifiers, plan to migrate consumers to the new feed (see ui.job_status.legacy_feed.v1).

Telemetry is designed to be safe by default, with clear levers for when you need to scale up observability or dial it down. Use the controls above to keep the signal-to-noise ratio aligned with your operational requirements.