Skip to content

Working With Data

Lyft Data assumes every event is a JSON object. Inputs that read text wrap it in the _raw field, and actions mutate that JSON in place. This guide shows how to pull structure out of raw streams, convert values into meaningful units, and emit alternate formats when a downstream system insists on CSV or key/value text.

JSON as the baseline

  • Each exec input line becomes { "_raw": "..." } unless you set json: true. Turn long command output into a single event with ignore-linebreaks: true.
  • Capture stdout and stderr separately by naming the fields up front:
input:
exec:
command: uptime
result-stdout-field: uptime_raw
result-stderr-field: uptime_errors
ignore-linebreaks: true
  • When these convenience fields exist you can feed them directly into parsing actions without touching _raw.

Extract unstructured text

Use extract when data only becomes structured after a regular expression match. Named capture groups become fields; unnamed groups map to output-fields in order. The action is tolerant by default—events that do not match fall through unchanged.

- extract:
input-field: uptime_raw
pattern: 'load average: (\S+), (\S+), (\S+)'
output-fields: [load1, load5, load15]
remove: true # drop uptime_raw after parsing
warnings: true # attach a warning if the pattern fails
drop: false # keep non-matching events moving

Chaining convert immediately after extract turns string captures into real numbers:

- convert:
conversions:
- field: load1
conversion: num
- field: load5
conversion: num
- field: load15
conversion: num

Convert values and units

convert can coerce many fields at once. Mix simple type conversions with unit-aware transforms so you do not carry around “512K” or “250ms” strings.

- convert:
conversions:
- field: retries
conversion: num
units:
- field: mem_bytes
from: K
to: M
- field: latency
from: ms
to: s

If a conversion crosses paths with empty values, pick a null-behaviour (keep or default) to avoid surprises.

Expand structured text

CSV and delimited tables

Turn delimited rows into JSON by pointing expand at the field and enabling CSV mode features:

- expand:
input-field: uptime_raw
remove: true
header: true # first line is a header row
delim: ','
name-type-pairs:
- port:num
- throughput:num
  • Set gen-headers: true when files omit a header; the runtime emits _0, _1, … with inferred types.
  • Provide explicit schemas with name-type-pairs or field-file (one name:type entry per line).
  • Use header-field, header-field-types, and header-field-on-change to read or emit schema rows alongside the data so downstream jobs can rehydrate the headers later.

Key/value logs

Key/value formats become JSON with one flag:

- expand:
input-field: _raw
key-value-pairs: true
autoconvert: true
delim: ' '
key-value-delim: '='
remove: true

If the producer repeats keys, pick a strategy with multiple: array (collect all values), first, or last. Supply name-type-pairs to enforce types on specific keys.

Multi-line records

Some tools print blocks of text where each block represents a single event. Capture them with ignore-linebreaks: true on the input or configure expand’s multiline collector:

- expand:
input-field: _raw
remove: true
multiline-var-patterns:
- summary:^Summary: (.*)$
- details:^Details:$
multiline-reset-pattern: '^--+$'

The collector buffers lines until the next pattern matches, then emits a JSON object built from the captured groups.

Embedded JSON strings

When a field contains quoted JSON, set json: true and let the action merge the decoded object into the event:

- expand:
input-field: payload
json: true
remove: true

Splitting one record into many

expand can also fan out events:

  • Delimited strings: set event-field to the destination field name and provide a delimiter.

    - expand:
    input-field: hosts
    event-field: host
    delim: '\n'
    remove: true
    # Emits one event per host while preserving other fields
  • JSON arrays: enable event-expand: true and optionally list JSON Pointers in event-skip-list to keep nested arrays intact.

    - expand:
    input-field: results
    event-expand: true
    event-skip-list:
    - /metadata

Preparing non-JSON output

Hotrod’s raw action is gone; instead, use collapse or templates to stage formatted payloads before sending them to exec, HTTP, or message outputs.

  • Build CSV or key/value strings with collapse:

    - collapse:
    output-field: payload
    format: kv
    delim: ' '
    key-value-delim: '='
    input-fields: [tenant, host, status]
    output:
    http-post:
    url: https://collector.example.com/ingest
    body-field: payload
  • Prefix structured logs (for example, @cee:) with add templates:

    - add:
    template-result-field: cee_line
    template: |
    @cee: {"status":"${status}","detail":"${detail}"}
    output:
    exec:
    command: 'logger -t lyftdata'
    input-field: cee_line
  • For multi-line bodies, render them into a template-result-field and point http-post.body-field or exec.input-field at that field so the action sends your custom text untouched.

Handy toggles and safeguards

  • remove: drop source fields after expansion so you do not keep both _raw and the parsed JSON.
  • warnings / suppress-warnings: control whether parsing errors attach job attachments.
  • document-mode (inputs) and document-start (actions): reset state when reading batched files or archives.
  • drop on extract: stop events cold when the pattern fails, which is useful for validation pipelines.

Mastering these primitives gives you the same reach Hotrod offered, but with Lyft Data’s structured runtime. Parse raw feeds, reshape them into clean JSON, and emit exactly what your downstream systems expect.