Scaling LyftData

Use this playbook when jobs start to backlog, worker utilisation spikes, or new workloads arrive. LyftData scales horizontally by adding workers and vertically by increasing resources per node.

Know when to scale

Monitor these signals (see Monitoring LyftData for how to collect them):

Metric	Action threshold	Typical response
Worker CPU usage	> 80% sustained	Add workers or increase CPU
Worker memory usage	> 85% sustained	Increase memory or adjust batch size
Job queue length	> 100 pending	Add workers / tune scheduling
Job runtime	2× baseline	Profile actions, add capacity
Error/retry rate	> 5%	Investigate downstream systems before scaling

Horizontal scaling (add workers)

Provision new worker hosts or containers as close to your data sources as possible to minimise latency.
For licensed deployments, create a worker ID in Workers and an API key in Settings → API Keys (see Workers).

Point the worker at the control plane and supply credentials:

lyftdata-worker \
  --url https://lyftdata.example.com \
  --worker-id worker-02 \
  --worker-api-key "$LYFTDATA_WORKER_API_KEY"

For fleets that rely on auto enrollment, configure auto-enrollment on the server, then start each worker with LYFTDATA_AUTO_ENROLLMENT_KEY instead of pre-issuing API keys.
Verify the worker registers successfully via the UI or lyftdata workers list.

Worker configuration checklist

LYFTDATA_URL – HTTPS URL for the control plane.
Identity – either LYFTDATA_WORKER_ID + LYFTDATA_WORKER_API_KEY, or LYFTDATA_AUTO_ENROLLMENT_KEY with an optional LYFTDATA_WORKER_NAME.
LYFTDATA_WORKER_LABELS – optional comma-separated descriptors surfaced in the UI for filtering and documentation.
--limit-job-capabilities – start the worker with this flag when it should only accept basic connectors (handy for hardened ingress zones).
Systemd or your orchestrator should manage the worker process so it restarts automatically after upgrades or crashes.

Workers advertise their capabilities back to the server. The UI shows each worker’s advertised limits (including maximum concurrent jobs). In Community Edition, scaling remains constrained by built-in-worker-only operation, licensed-feature gating, and the daily processing cap.

Vertical scaling (bigger boxes)

When horizontal scale is not practical:

Increase CPU/memory allocations for the server or existing workers.
Ensure disks that host the staging directory have headroom; adjust --db-retention-days or --disk-usage-max-percentage if cleanup is too aggressive.
Revisit batch sizes and action efficiency so additional resources translate into throughput.

Automate worker scaling

Poll server health and worker status on a schedule (via the UI, CLI, or API) and persist readings so you can compare moving averages instead of reacting to single spikes.
Feed those metrics into your autoscaling tool (Kubernetes HPA, AWS ASG, Nomad autoscaler, etc.). A typical policy adds a worker when queue depth stays above your threshold for several minutes.
During scale-in, stop workers gradually and watch for new backlog or error spikes.
Alert whenever automation fails (for example, API calls start returning errors or workers flap repeatedly) so operators can intervene manually.

Placement strategies

Region-aware: co-locate workers with data sources to reduce egress and latency.
Capability-aware: dedicate workers to specialised connectors or sensitive networks and label them so runbooks show where to deploy specific jobs.
Redundancy: keep spare capacity in another availability zone for failover scenarios.

Keep jobs efficient

Scaling is easier when jobs are lean:

Split large multi-purpose jobs into smaller stages connected by worker channels.
Use filter early to drop unnecessary records.
Profile Lua scripts and external lookups; cache results where possible.
Adjust scheduling cadence (see Advanced scheduling) to smooth bursty workloads.

Review after scaling

Validate that job runtimes and queue lengths return to normal.
Update documentation/runbooks with the new topology.
Revisit resource budgets monthly; scale down when demand shrinks to save cost.

Pair this guide with the backup and security runbooks to keep operations predictable as the platform grows.