Skip to content

Reducing Data Volume

Reducing Data Volume

Edge workers can discard noise and compress payloads before data leaves a site. The actions below mirror the Hotrod guidance but use Lyft Data’s DSL and runtime semantics so remote jobs can keep bandwidth and licensing costs in check.

Drop events early

Use the filter action to keep only events that match fixed patterns or Lua conditions. Multiple filters can be chained—one to match fields, another to gate by a numeric threshold.

actions:
- filter:
field-pattern-pairs:
- severity: 'high'
- source: '^GAUTENG-'
- filter:
condition: speed > 1

If you only want specific keys to survive, switch the filter into schema mode so that any other fields are dropped in-place:

actions:
- filter:
schema:
- source
- destination
- sent_kilobytes_per_sec

Combine this with the remove and rename actions to strip temporary fields or shorten key names before handing events to the next hop.

Forward only changes

The stream action keeps a running value and emits deltas. Set only-changes so that identical samples disappear, and Lyft Data will also include an elapsed time field for context.

actions:
- stream:
delta: true
watch: throughput
only-changes: true
output-field: delta
elapsed-field: elapsed_ms
- filter:
condition: delta != 0

This pattern is ideal for forwarding counters or gauges that rarely change but must be monitored continuously.

Trim payload fields

After filtering, remove drops helper keys (optionally via regular expressions) and rename shortens long field names so JSON payloads shrink on the wire:

actions:
- remove:
fields: ["_raw", "debug"]
- rename:
key-value-pairs:
- source=s
- destination=d
- sent_kilobytes_per_sec=sent

Compact payloads for transport

If your downstream systems accept batches, the biggest wins usually come from batching and compressing at the output boundary:

  • Use output batch.wrap-as-json to send JSON arrays instead of many small JSON documents.
  • Use object-store output preprocessors: [gzip] to compress the payload before upload.

Example: batch + gzip to S3

output:
s3:
bucket-name: logs-archive
object-name:
name: "@{job}/run-${stat|_BATCH_NUMBER}.json.gz"
preprocessors:
- gzip
batch:
mode: fixed
fixed-size: 500
timeout: 1s
wrap-as-json: true
retry:
timeout: 30s
retries: 5

With these primitives, edge workers can minimize bandwidth and storage costs while still delivering complete records to central collectors.