Google Cloud Storage (GCS)
GCS
Lyft Data ships first-class support for reading from and writing to Google Cloud Storage (GCS).
Configure Lyft Data to read from GCS
Add the gcs input to a job. Key fields (names shown exactly as they appear in the job spec):
bucket-name– target bucket (required).object-names– list of object names or prefixes. Leave empty to operate on every object surfaced by the selected mode; be cautious with very large buckets.mode– chooselist,download, orlist-and-downloaddepending on whether you need metadata only, already-identified objects, or listing plus download.ignore-linebreaks– set when an object should be surfaced as a single event instead of newline-delimited events.timestamp-mode– derive timestamps from creation time, last modified time, or processing time to drive downstream filtering.include-regex/exclude-regexandmaximum-age– reduce the candidate list before downloading by pattern or by an age window like2hor36h45m.fingerprintingandmaximum-fingerprint-age– prevent re-processing by persisting object hashes, and control how long fingerprints are kept.credentials– provide a service-account JSON, application default credentials, or ags://URL through the availableGcsCredentialsvariants. Values can be inlined or supplied via context interpolation.preprocessors– configure gzip/parquet/base64/extension handlers for content transformation during ingest.
Example: list and download JSON objects
input: gcs: bucket-name: analytics-prod object-names: - exports/daily/ mode: list-and-download ignore-linebreaks: true include-regex: - "\\.json(\.gz)?$" maximum-age: 6h fingerprinting: true timestamp-mode: last-modified credentials: service-account: key: ${secrets.analytics_gcs_reader} preprocessors: - gzipThis configuration lists objects under exports/daily/, filters to recent JSON or JSON.gz files, downloads each object once, and surfaces the entire payload as a single event.
Configure Lyft Data to write to GCS
Add the gcs output to a job. Key fields:
bucket-name– destination bucket (required).object-name– literal or field-derived destination viaObjectDestination. Usename: ...for a fixed path orfield: ...to reuse an event field.mode– set toput(default) to upload events ordeleteto remove objects by name.disable-object-name-guid,guid-prefix,guid-suffix– control the automatic GUID that is prepended to uploads to avoid collisions. Disabling the GUID requires that your object names are already unique.input-field– choose the field that contains the body to upload. When omitted, the full event is serialized after preprocessors run.batch&retry– tune batching for throughput and configure backoff/max-attempts for robustness.track-schema– enable when writing JSON so__SCHEMA_NUMBERis updated alongside the payload.credentials– reuse the sameGcsCredentialsforms as the input (service account key, application credentials, orgs://URLs).preprocessors– run gzip/base64/extension handlers before the payload is written to GCS.
Example: upload processed events with deterministic names
output: gcs: bucket-name: analytics-prod-archive object-name: name: processed/2025-09-17/summary.json disable-object-name-guid: true input-field: payload preprocessors: - gzip track-schema: true credentials: service-account: path: /etc/lyftdata/service-account.jsonExample: delete source objects after successful processing
output: gcs: bucket-name: analytics-prod object-name: field: object_name mode: delete credentials: application-credentials: key: ${secrets.analytics_gcs_writer}The delete output expects each incoming event to carry the object name (for example from the GCS input). Deletion runs without generating GUID prefixes.
Recommendations for files and folders
- keep individual files below 100–150 MB (compressed gzip or Parquet) for predictable processing latency
- organise exported data with directory-style prefixes such as
Y/M/,Y/M/D/, orY/M/D/H/