Skip to content

Google Cloud Storage (GCS)

GCS

Lyft Data ships first-class support for reading from and writing to Google Cloud Storage (GCS).

Configure Lyft Data to read from GCS

Add the gcs input to a job. Key fields (names shown exactly as they appear in the job spec):

  • bucket-name – target bucket (required).
  • object-names – list of object names or prefixes. Leave empty to operate on every object surfaced by the selected mode; be cautious with very large buckets.
  • mode – choose list, download, or list-and-download depending on whether you need metadata only, already-identified objects, or listing plus download.
  • ignore-linebreaks – set when an object should be surfaced as a single event instead of newline-delimited events.
  • timestamp-mode – derive timestamps from creation time, last modified time, or processing time to drive downstream filtering.
  • include-regex / exclude-regex and maximum-age – reduce the candidate list before downloading by pattern or by an age window like 2h or 36h45m.
  • fingerprinting and maximum-fingerprint-age – prevent re-processing by persisting object hashes, and control how long fingerprints are kept.
  • credentials – provide a service-account JSON, application default credentials, or a gs:// URL through the available GcsCredentials variants. Values can be inlined or supplied via context interpolation.
  • preprocessors – configure gzip/parquet/base64/extension handlers for content transformation during ingest.

Example: list and download JSON objects

input:
gcs:
bucket-name: analytics-prod
object-names:
- exports/daily/
mode: list-and-download
ignore-linebreaks: true
include-regex:
- "\\.json(\.gz)?$"
maximum-age: 6h
fingerprinting: true
timestamp-mode: last-modified
credentials:
service-account:
key: ${secrets.analytics_gcs_reader}
preprocessors:
- gzip

This configuration lists objects under exports/daily/, filters to recent JSON or JSON.gz files, downloads each object once, and surfaces the entire payload as a single event.

Configure Lyft Data to write to GCS

Add the gcs output to a job. Key fields:

  • bucket-name – destination bucket (required).
  • object-name – literal or field-derived destination via ObjectDestination. Use name: ... for a fixed path or field: ... to reuse an event field.
  • mode – set to put (default) to upload events or delete to remove objects by name.
  • disable-object-name-guid, guid-prefix, guid-suffix – control the automatic GUID that is prepended to uploads to avoid collisions. Disabling the GUID requires that your object names are already unique.
  • input-field – choose the field that contains the body to upload. When omitted, the full event is serialized after preprocessors run.
  • batch & retry – tune batching for throughput and configure backoff/max-attempts for robustness.
  • track-schema – enable when writing JSON so __SCHEMA_NUMBER is updated alongside the payload.
  • credentials – reuse the same GcsCredentials forms as the input (service account key, application credentials, or gs:// URLs).
  • preprocessors – run gzip/base64/extension handlers before the payload is written to GCS.

Example: upload processed events with deterministic names

output:
gcs:
bucket-name: analytics-prod-archive
object-name:
name: processed/2025-09-17/summary.json
disable-object-name-guid: true
input-field: payload
preprocessors:
- gzip
track-schema: true
credentials:
service-account:
path: /etc/lyftdata/service-account.json

Example: delete source objects after successful processing

output:
gcs:
bucket-name: analytics-prod
object-name:
field: object_name
mode: delete
credentials:
application-credentials:
key: ${secrets.analytics_gcs_writer}

The delete output expects each incoming event to carry the object name (for example from the GCS input). Deletion runs without generating GUID prefixes.

Recommendations for files and folders

  • keep individual files below 100–150 MB (compressed gzip or Parquet) for predictable processing latency
  • organise exported data with directory-style prefixes such as Y/M/, Y/M/D/, or Y/M/D/H/