Scaling Lyft Data
Use this playbook when jobs start to backlog, worker utilisation spikes, or new workloads arrive. Lyft Data scales horizontally by adding workers and vertically by increasing resources per node.
Know when to scale
Monitor these signals (see Monitoring Lyft Data for how to collect them):
| Metric | Action threshold | Typical response |
|---|---|---|
| Worker CPU usage | > 80% sustained | Add workers or increase CPU |
| Worker memory usage | > 85% sustained | Increase memory or adjust batch size |
| Job queue length | > 100 pending | Add workers / tune scheduling |
| Job runtime | 2× baseline | Profile actions, add capacity |
| Error/retry rate | > 5% | Investigate downstream systems before scaling |
Horizontal scaling (add workers)
- Provision new worker hosts or containers as close to your data sources as possible to minimise latency.
- Point the worker at the control plane and supply credentials:
Terminal window lyftdata run worker \--url https://lyftdata.example.com \--worker-id worker-02 \--worker-api-key "$LYFTDATA_WORKER_API_KEY" - For fleets that rely on auto enrollment, set
LYFTDATA_AUTO_ENROLLMENT_KEYon the server once and provide the same value to each worker. Use it only on trusted networks. - Verify the worker registers successfully via the UI or
curl -s https://lyftdata.example.com/api/workers | jq '.[].id'.
Worker configuration checklist
LYFTDATA_URL– HTTPS URL for the control plane.- Identity – either
LYFTDATA_WORKER_ID+LYFTDATA_WORKER_API_KEY, orLYFTDATA_AUTO_ENROLLMENT_KEYwith an optionalLYFTDATA_WORKER_NAME. LYFTDATA_WORKER_LABELS– optional comma-separated descriptors surfaced in the UI for filtering and documentation.--limit-job-capabilities– start the worker with this flag when it should only accept basic connectors (handy for hardened ingress zones).- Systemd or your orchestrator should manage the worker process so it restarts automatically after upgrades or crashes.
Workers advertise their capabilities back to the server. Community Edition limits concurrent jobs to five per worker; licensed deployments default to ten.
Vertical scaling (bigger boxes)
When horizontal scale is not practical:
- Increase CPU/memory allocations for the server or existing workers.
- Ensure disks that host the staging directory have headroom; adjust
--db-retention-daysor--disk-usage-max-percentageif cleanup is too aggressive. - Revisit batch sizes and action efficiency so additional resources translate into throughput.
Automate worker scaling
- Poll
/api/healthand/api/workerson a schedule to collect utilisation, queue depth, and last-seen timestamps. Persist the readings so you can compare moving averages instead of reacting to single spikes. - Feed those metrics into your autoscaling tool (Kubernetes HPA, AWS ASG, Nomad autoscaler, etc.). A typical policy adds a worker when queue depth stays above your threshold for several minutes.
- During scale-in, drain jobs first: pause new assignments for the worker, wait for active runs to complete, then stop the process.
- Alert whenever automation fails (for example, API calls start returning errors or workers flap repeatedly) so operators can intervene manually.
Placement strategies
- Region-aware: co-locate workers with data sources to reduce egress and latency.
- Capability-aware: dedicate workers to specialised connectors or sensitive networks and label them so runbooks show where to deploy specific jobs.
- Redundancy: keep spare capacity in another availability zone for failover scenarios.
Keep jobs efficient
Scaling is easier when jobs are lean:
- Split large multi-purpose jobs into smaller stages connected by worker channels.
- Use
filterearly to drop unnecessary records. - Profile Lua scripts and external lookups; cache results where possible.
- Adjust scheduling cadence (see Advanced scheduling) to smooth bursty workloads.
Review after scaling
- Validate that job runtimes and queue lengths return to normal.
- Update documentation/runbooks with the new topology.
- Revisit resource budgets monthly; scale down when demand shrinks to save cost.
Pair this guide with the backup and security runbooks to keep operations predictable as the platform grows.