S3 Object Storage (AWS, Minio, Wasabi, Linode, Etc.)

S3

Use this guide to configure Lyft Data jobs that read from or write to S3-compatible object storage (AWS S3, MinIO, Wasabi, DigitalOcean Spaces, etc.). Bring credentials with permission to list, read, and write the target buckets before you start.

Configure Lyft Data to read from S3

Add the s3 input to a job. Key fields (names shown exactly as they appear in the job spec):

bucket-name – target bucket (required).
object-names – object keys or prefixes. Leave empty to scan the entire bucket when the chosen mode allows listing.
region / endpoint – optional region or custom endpoint (for MinIO and other providers).
mode – switch between list-objects, download-objects, and list-and-download-objects depending on whether you only need metadata, specific keys, or listing + download.
ignore-linebreaks – treat the entire object as a single event instead of splitting on newlines. Combine with preprocessors if you need to parse structured data afterwards.
timestamp-mode – derive timestamps from last-modified, object-name patterns, or disable timestamp filtering.
include-regex / exclude-regex / maximum-age – reduce the candidate list by pattern or age (accepts durations like 15m, 6h30m).
fingerprinting / maximum-fingerprint-age – avoid reprocessing objects and control dedupe retention.
access-key, secret-key, security-token, session-token, role-arn – authentication. Provide the combination required by your platform. Values can be literal strings or injected via context variables.
preprocessors – gzip/parquet/base64/extension handlers applied after download.

Example: list and download Parquet objects

input:
  s3:
    bucket-name: analytics-prod
    object-names:
      - exports/daily/
    mode: list-and-download-objects
    include-regex:
      - '\\.parquet(\\.gz)?$'
    maximum-age: 12h
    fingerprinting: true
    timestamp-mode: last-modified
    access-key: "{{secrets.s3_access_key}}"
    secret-key: "{{secrets.s3_secret_key}}"
    preprocessors:
      - parquet

This configuration lists objects beneath exports/daily/, filters to recent Parquet files, downloads each object once, and converts rows into JSON events.

Example: download a specific object as a single event

input:
  s3:
    bucket-name: analytics-prod
    object-names:
      - raw/snapshot.json
    mode: download-objects
    ignore-linebreaks: true
    access-key: "{{secrets.s3_access_key}}"
    secret-key: "{{secrets.s3_secret_key}}"
    preprocessors:
      - extension

Configure Lyft Data to write to S3

Add the s3 output to a job. Key fields:

bucket-name – destination bucket (required).
object-name – literal or field-derived destination. Use name: ... for a fixed key or field: ... to reuse an event field.
mode – put (default) uploads objects; delete removes objects by key.
disable-object-name-guid, guid-prefix, guid-suffix – control GUID-based dedupe for writes; disabling requires that generated names are already unique.
input-field – source field to upload; omit to serialize the entire event after preprocessors.
batch & retry – optimize throughput and resiliency.
track-schema – maintain __SCHEMA_NUMBER when writing JSON payloads.
access-key, secret-key, security-token, session-token, role-arn – same authentication options as the input.
preprocessors – gzip/base64/extension handlers executed before upload.

Example: upload processed events with dynamic keys

output:
  s3:
    bucket-name: analytics-prod-archive
    object-name:
      name: processed/${partition}/summary.json
    disable-object-name-guid: true
    input-field: payload
    preprocessors:
      - gzip
    track-schema: true
    access-key: "{{secrets.s3_writer_access_key}}"
    secret-key: "{{secrets.s3_writer_secret_key}}"

Example: delete the original object after processing

output:
  s3:
    bucket-name: analytics-prod
    object-name:
      field: object_name
    mode: delete
    access-key: "{{secrets.s3_writer_access_key}}"
    secret-key: "{{secrets.s3_writer_secret_key}}"

The delete output expects each incoming event to include object_name (for example from the S3 input). Delete operations do not generate GUID prefixes.

Recommendations for buckets and folders

Keep individual objects below 100–150 MB (compressed) for predictable processing latency.
Organise exports using partition-style prefixes such as Y/M/, Y/M/D/, or Y/M/D/H/.

Creating an AWS policy for programmatic access

Grant the job permissions to list and read/write objects. Replace BUCKETNAME with your bucket name:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "List",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::BUCKETNAME"
    },
    {
      "Sid": "RW",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::BUCKETNAME/*"
    }
  ]
}