Transforming Data

Lyft Data jobs turn raw payloads into analytics-ready events by chaining actions between the input and output. This guide shows how to choose the right transformation tools, keep schemas predictable, and test changes safely.

Plan the transformation graph

Jobs can run without any actions at all, but most pipelines layer several transformations. Sketch the desired event shape first: which fields must be present, which optional attributes can fall back to defaults, and how the output consumer expects the data to be partitioned. That model determines which actions you need and the order they should run.

Useful questions before you start coding:

Does the input already emit JSON, or do you need to parse delimited text?
Are nested arrays acceptable downstream, or should they be flattened into individual events?
Which fields require type conversions (strings to numbers, epoch timestamps to ISO-8601)?
What validation rules should block a deployment versus simply dropping the event?

Parse raw payloads

Choose the parser that matches the incoming format:

json promotes a JSON string field into event fields without writing a script.
csv turns comma-delimited rows into structured fields and can infer numeric types when autoconvert is enabled.
key-value handles log lines like foo=bar level=warn by letting you specify separators.
extract applies regular expressions with capture groups when the payload follows a predictable pattern but lacks delimiters.

When working with documents that contain arrays or nested objects, follow up with expand-events to break the array into multiple events or flatten to move nested keys to dotted field names (user.id, user.email).

Add and reshape fields

Once the base structure is in place, layer field edits:

add inserts literals, context placeholders ({{environment}}), or runtime expansions (${msg|...}) without overwriting existing data unless you set overwrite: true.
copy and rename move fields around the schema without scripting.
remove drops fields that downstream systems do not need, keeping payloads lean.
convert coerces strings into numbers, booleans, or timestamps based on the type hints you supply.
time normalises timestamps, applies time-zone rules, and can tag records that fall inside business-hour windows.
script (Lua) remains the escape hatch for conditional logic, hashing, or functions not covered by the declarative actions; use it sparingly so pipelines stay readable.

Remember the difference between context placeholders and runtime expansions: {{ }} resolve when the job is staged or deployed, while ${ } evaluate per event inside the worker. Use context for environment-specific constants, and runtime expansions for event data, message payloads, or scheduler metadata.

Guard the pipeline

Transformations should fail loudly when the data no longer matches expectations:

filter drops events that do not meet quality bars (for example, missing required fields or falling outside an allow-listed region).
assert stops the job when a condition fails, which is ideal for staging environments where schema drift must block promotion.
abort terminates the run immediately with a clear error message when an unrecoverable condition occurs.
message emits structured control-plane messages that other jobs can subscribe to, useful for triggering alerts or follow-up pipelines when specific patterns appear.

Pair these safeguards with metrics from your output or the Job Status view so operators spot regressions quickly.

Test iteratively

Use the Visual Editor’s Run & Trace feature after each significant change. It posts the current job definition to /api/jobs/run, executes it once on a worker, and streams each step’s input and output back to the browser. Inspect the traces to confirm field names, types, and counts before you stage a new version.

When the pipeline passes local checks:

Save and Stage the job so the configuration becomes immutable.
Deploy to a non-production worker and watch Operate > Job status for event throughput and validation errors.
Promote to production once the metrics look healthy, following the deployment workflow in Deploying jobs or CI/CD automation.

Pattern library

Here are compact examples you can adapt:

Convert text logs into structured events

actions:
  - extract:
      input-field: raw_line
      pattern: "(?P<ip>\d+\.\d+\.\d+\.\d+) - (?P<user>\w+) "(?P<verb>\w+) (?P<path>[^"]+)" (?P<status>\d+)"
  - convert:
      fields:
        status: num
  - add:
      output-fields:
        received_at: "${time|now_time_iso}"

Flatten nested JSON and enrich with context

actions:
  - json:
      input-field: payload
  - flatten:
      input-field: event
  - add:
      output-fields:
        dataset: "{{dataset}}"
        env: "{{environment}}"
  - filter:
      how:
        expression: "event.type == 'purchase'"

Guard schema drift in staging

actions:
  - assert:
      condition: "exists(event.order_id)"
      message: "order_id missing from upstream payload"
  - time:
      input-field: event.timestamp
      output-field: '@timestamp'
      input-formats:
        - default_iso
      output-format: default_iso

Collect these snippets in your team runbook so new pipelines start from proven patterns instead of reinventing transformations each time.