Google Cloud Storage (GCS)

GCS

Lyft Data ships first-class support for reading from and writing to Google Cloud Storage (GCS).

Configure Lyft Data to read from GCS

Add the gcs input to a job. Key fields (names shown exactly as they appear in the job spec):

bucket-name – target bucket (required).
object-names – list of object names or prefixes. Leave empty to operate on every object surfaced by the selected mode; be cautious with very large buckets.
mode – choose list-objects, download-objects, or list-and-download-objects depending on whether you need metadata only, already-identified objects, or listing plus download.
ignore-linebreaks – set when an object should be surfaced as a single event instead of newline-delimited events.
timestamp-mode – derive timestamps from creation time, last modified time, or processing time to drive downstream filtering.
include-regex / exclude-regex and maximum-age – reduce the candidate list before downloading by pattern or by an age window like 2h or 36h45m.
fingerprinting and maximum-fingerprint-age – prevent re-processing by persisting object hashes, and control how long fingerprints are kept.
credentials – provide a service-account JSON, application default credentials, or a gs:// URL through the available GcsCredentials variants. Values can be inlined or supplied via context interpolation.
preprocessors – configure gzip/parquet/base64/extension handlers for content transformation during ingest.

Example: list and download JSON objects

input:
  gcs:
    bucket-name: analytics-prod
    object-names:
      - exports/daily/
    mode: list-and-download-objects
    ignore-linebreaks: true
    include-regex:
      - '\\.json(\\.gz)?$'
    maximum-age: 6h
    fingerprinting: true
    timestamp-mode: last-modified
    credentials:
      service-account:
        key: "{{secrets.analytics_gcs_reader}}"
    preprocessors:
      - gzip

This configuration lists objects under exports/daily/, filters to recent JSON or JSON.gz files, downloads each object once, and surfaces the entire payload as a single event.

Configure Lyft Data to write to GCS

Add the gcs output to a job. Key fields:

bucket-name – destination bucket (required).
object-name – literal or field-derived destination via ObjectDestination. Use name: ... for a fixed path or field: ... to reuse an event field.
mode – set to put (default) to upload events or delete to remove objects by name.
disable-object-name-guid, guid-prefix, guid-suffix – control the automatic GUID that is prepended to uploads to avoid collisions. Disabling the GUID requires that your object names are already unique.
input-field – choose the field that contains the body to upload. When omitted, the full event is serialized after preprocessors run.
batch & retry – tune batching for throughput and configure retries/timeouts for robustness.
track-schema – enable when writing JSON so __SCHEMA_NUMBER is updated alongside the payload.
credentials – reuse the same GcsCredentials forms as the input (service account key, application credentials, or gs:// URLs).
preprocessors – run gzip/base64/extension handlers before the payload is written to GCS.

Example: upload processed events with deterministic names

output:
  gcs:
    bucket-name: analytics-prod-archive
    object-name:
      name: processed/2025-09-17/summary.json
    disable-object-name-guid: true
    input-field: payload
    preprocessors:
      - gzip
    track-schema: true
    credentials:
      service-account:
        path: /etc/lyftdata/service-account.json

Example: delete source objects after successful processing

output:
  gcs:
    bucket-name: analytics-prod
    object-name:
      field: object_name
    mode: delete
    credentials:
      application-credentials:
        key: "{{secrets.analytics_gcs_writer}}"

The delete output expects each incoming event to carry the object name (for example from the GCS input). Deletion runs without generating GUID prefixes.

Recommendations for files and folders

keep individual files below 100–150 MB (compressed gzip or Parquet) for predictable processing latency
organise exported data with directory-style prefixes such as Y/M/, Y/M/D/, or Y/M/D/H/