Transform Google Analytics 4 Exports
This how-to converts GA4 export files into a predictable event schema you can route to storage, APIs, or downstream jobs.
What this pipeline should produce
- One event per GA4
events[]item. - A canonical
@timestampfield. - Consistent, flattened field names for downstream consumers.
- Optional AI-derived labels (Enterprise) for segmentation.
1. Configure the input
- Create a job (for example
ga4-normalized). - Select your object-store input (
s3,gcs,azure-blob, orfile-store). - Set the object prefix (for example
exports/ga4/daily/) and enable fingerprinting. - Under response handling, split records from the
eventsarray.
Tip: keep replay simple by using date-based prefixes and rotating job versions rather than editing production definitions in place.
2. Baseline transformation stack
actions: - json: input-field: data - expand-events: array-field: events - flatten: input-field: events separator: "." - rename: fields: "events.event_params.key": param_key "events.event_params.value.string_value": param_value - filter: how: expression: "events.name == 'purchase'" - convert: fields: events.event_timestamp: num units: - field: events.event_timestamp from: microseconds to: milliseconds - time: input-field: events.event_timestamp input-formats: - epoch_msecs output-field: '@timestamp' output-format: default_iso - add: output-fields: dataset: "{{dataset}}" environment: "{{environment}}" source_object: "${msg|message_content.object_name||unknown}"Why this order
- Parse first (
json,expand-events). - Shape the schema (
flatten,rename,filter,convert). - Add operational metadata last (
time,add), including timestamp normalization.
This ordering keeps traces easier to read and reduces accidental field drift.
events.event_timestamp in GA4 exports is typically in microseconds, so this example normalizes it to milliseconds before using time with epoch_msecs. If your upstream payload already uses milliseconds, remove the units conversion.
3. Optional AI enrichment (Enterprise)
Use infer when you want automated classification (for example purchase intent, campaign grouping, or anomaly flags) without a separate service.
- infer: workload: llm-completion: llm: provider: openai-compat model: your-model input-field: events.name response-field: ai_labels response-format: json prompt: schema: '{"type":"object"}' timeout-ms: 10000 on-error: skipRecommended defaults for production: set rate-limit, concurrency, and cache, and validate ai_labels shape in Run & Trace.
4. Add quality gates
Use filters for recoverable issues and assertions for hard contract failures.
- filter: how: expression: "exists(events.user_pseudo_id)"- assert: schema: schema-string: '{"type":"object","required":["events.event_timestamp","events.user_pseudo_id"]}' behaviour: abort-on-failure5. Run & Trace checklist
- Confirm event count after
expand-eventsmatches GA4 array size. - Verify
@timestampconversion on real samples. - Check renamed fields used by downstream dashboards.
- For AI steps, verify
ai_labelsremains valid JSON across multiple runs.
6. Stage, deploy, and monitor
- Stage the job after trace validation.
- Deploy to non-production first and monitor throughput/error rates.
- Promote to production via your standard release path (manual or CI/CD automation).
- Add alerts for parse failures, assertion failures, and output delivery errors.