Skip to content

Scaling Lyft Data

Use this playbook when jobs start to backlog, worker utilisation spikes, or new workloads arrive. Lyft Data scales horizontally by adding workers and vertically by increasing resources per node.

Know when to scale

Monitor these signals (see Monitoring Lyft Data for how to collect them):

MetricAction thresholdTypical response
Worker CPU usage> 80% sustainedAdd workers or increase CPU
Worker memory usage> 85% sustainedIncrease memory or adjust batch size
Job queue length> 100 pendingAdd workers / tune scheduling
Job runtime2× baselineProfile actions, add capacity
Error/retry rate> 5%Investigate downstream systems before scaling

Horizontal scaling (add workers)

  • Provision new worker hosts or containers as close to your data sources as possible to minimise latency.
  • Point the worker at the control plane and supply credentials:
    Terminal window
    lyftdata run worker \
    --url https://lyftdata.example.com \
    --worker-id worker-02 \
    --worker-api-key "$LYFTDATA_WORKER_API_KEY"
  • For fleets that rely on auto enrollment, set LYFTDATA_AUTO_ENROLLMENT_KEY on the server once and provide the same value to each worker. Use it only on trusted networks.
  • Verify the worker registers successfully via the UI or curl -s https://lyftdata.example.com/api/workers | jq '.[].id'.

Worker configuration checklist

  • LYFTDATA_URL – HTTPS URL for the control plane.
  • Identity – either LYFTDATA_WORKER_ID + LYFTDATA_WORKER_API_KEY, or LYFTDATA_AUTO_ENROLLMENT_KEY with an optional LYFTDATA_WORKER_NAME.
  • LYFTDATA_WORKER_LABELS – optional comma-separated descriptors surfaced in the UI for filtering and documentation.
  • --limit-job-capabilities – start the worker with this flag when it should only accept basic connectors (handy for hardened ingress zones).
  • Systemd or your orchestrator should manage the worker process so it restarts automatically after upgrades or crashes.

Workers advertise their capabilities back to the server. Community Edition limits concurrent jobs to five per worker; licensed deployments default to ten.

Vertical scaling (bigger boxes)

When horizontal scale is not practical:

  • Increase CPU/memory allocations for the server or existing workers.
  • Ensure disks that host the staging directory have headroom; adjust --db-retention-days or --disk-usage-max-percentage if cleanup is too aggressive.
  • Revisit batch sizes and action efficiency so additional resources translate into throughput.

Automate worker scaling

  • Poll /api/health and /api/workers on a schedule to collect utilisation, queue depth, and last-seen timestamps. Persist the readings so you can compare moving averages instead of reacting to single spikes.
  • Feed those metrics into your autoscaling tool (Kubernetes HPA, AWS ASG, Nomad autoscaler, etc.). A typical policy adds a worker when queue depth stays above your threshold for several minutes.
  • During scale-in, drain jobs first: pause new assignments for the worker, wait for active runs to complete, then stop the process.
  • Alert whenever automation fails (for example, API calls start returning errors or workers flap repeatedly) so operators can intervene manually.

Placement strategies

  • Region-aware: co-locate workers with data sources to reduce egress and latency.
  • Capability-aware: dedicate workers to specialised connectors or sensitive networks and label them so runbooks show where to deploy specific jobs.
  • Redundancy: keep spare capacity in another availability zone for failover scenarios.

Keep jobs efficient

Scaling is easier when jobs are lean:

  • Split large multi-purpose jobs into smaller stages connected by worker channels.
  • Use filter early to drop unnecessary records.
  • Profile Lua scripts and external lookups; cache results where possible.
  • Adjust scheduling cadence (see Advanced scheduling) to smooth bursty workloads.

Review after scaling

  • Validate that job runtimes and queue lengths return to normal.
  • Update documentation/runbooks with the new topology.
  • Revisit resource budgets monthly; scale down when demand shrinks to save cost.

Pair this guide with the backup and security runbooks to keep operations predictable as the platform grows.