Transforming Data
Lyft Data jobs turn raw payloads into analytics-ready events by chaining actions between the input and output. This guide shows how to choose the right transformation tools, keep schemas predictable, and test changes safely.
Plan the transformation graph
Jobs can run without any actions at all, but most pipelines layer several transformations. Sketch the desired event shape first: which fields must be present, which optional attributes can fall back to defaults, and how the output consumer expects the data to be partitioned. That model determines which actions you need and the order they should run.
Useful questions before you start coding:
- Does the input already emit JSON, or do you need to parse delimited text?
- Are nested arrays acceptable downstream, or should they be flattened into individual events?
- Which fields require type conversions (strings to numbers, epoch timestamps to ISO-8601)?
- What validation rules should block a deployment versus simply dropping the event?
Parse raw payloads
Choose the parser that matches the incoming format:
jsonpromotes a JSON string field into event fields without writing a script.csvturns comma-delimited rows into structured fields and can infer numeric types whenautoconvertis enabled.key-valuehandles log lines likefoo=bar level=warnby letting you specify separators.extractapplies regular expressions with capture groups when the payload follows a predictable pattern but lacks delimiters.
When working with documents that contain arrays or nested objects, follow up with expand-events to break the array into multiple events or flatten to move nested keys to dotted field names (user.id, user.email).
Add and reshape fields
Once the base structure is in place, layer field edits:
addinserts literals, context placeholders ({{environment}}), or runtime expansions (${msg|...}) without overwriting existing data unless you setoverwrite: true.copyandrenamemove fields around the schema without scripting.removedrops fields that downstream systems do not need, keeping payloads lean.convertcoerces strings into numbers, booleans, or timestamps based on the type hints you supply.timenormalises timestamps, applies time-zone rules, and can tag records that fall inside business-hour windows.script(Lua) remains the escape hatch for conditional logic, hashing, or functions not covered by the declarative actions; use it sparingly so pipelines stay readable.
Remember the difference between context placeholders and runtime expansions: {{ }} resolve when the job is staged or deployed, while ${ } evaluate per event inside the worker. Use context for environment-specific constants, and runtime expansions for event data, message payloads, or scheduler metadata.
Guard the pipeline
Transformations should fail loudly when the data no longer matches expectations:
filterdrops events that do not meet quality bars (for example, missing required fields or falling outside an allow-listed region).assertstops the job when a condition fails, which is ideal for staging environments where schema drift must block promotion.abortterminates the run immediately with a clear error message when an unrecoverable condition occurs.messageemits structured control-plane messages that other jobs can subscribe to, useful for triggering alerts or follow-up pipelines when specific patterns appear.
Pair these safeguards with metrics from your output or the Job Status view so operators spot regressions quickly.
Test iteratively
Use the Visual Editor’s Run & Trace feature after each significant change. It posts the current job definition to /api/jobs/run, executes it once on a worker, and streams each step’s input and output back to the browser. Inspect the traces to confirm field names, types, and counts before you stage a new version.
When the pipeline passes local checks:
- Save and Stage the job so the configuration becomes immutable.
- Deploy to a non-production worker and watch Operate > Job status for event throughput and validation errors.
- Promote to production once the metrics look healthy, following the deployment workflow in Deploying jobs or CI/CD automation.
Pattern library
Here are compact examples you can adapt:
Convert text logs into structured events
actions: - extract: input-field: raw_line pattern: "(?P<ip>\d+\.\d+\.\d+\.\d+) - (?P<user>\w+) "(?P<verb>\w+) (?P<path>[^"]+)" (?P<status>\d+)" - convert: fields: status: num - add: output-fields: received_at: "${time|now_time_iso}"Flatten nested JSON and enrich with context
actions: - json: input-field: payload - flatten: input-field: event - add: output-fields: dataset: "{{dataset}}" env: "{{environment}}" - filter: how: expression: "event.type == 'purchase'"Guard schema drift in staging
actions: - assert: condition: "exists(event.order_id)" message: "order_id missing from upstream payload" - time: input-field: event.timestamp output-field: '@timestamp' input-formats: - default_iso output-format: default_isoCollect these snippets in your team runbook so new pipelines start from proven patterns instead of reinventing transformations each time.