S3 Object Storage (AWS, Minio, Wasabi, Linode, Etc.)
S3
Use this guide to configure Lyft Data jobs that read from or write to S3-compatible object storage (AWS S3, MinIO, Wasabi, DigitalOcean Spaces, etc.). Bring credentials with permission to list, read, and write the target buckets before you start.
Configure Lyft Data to read from S3
Add the s3 input to a job. Key fields (names shown exactly as they appear in the job spec):
bucket-name– target bucket (required).object-names– object keys or prefixes. Leave empty to scan the entire bucket when the chosenmodeallows listing.region/endpoint– optional region or custom endpoint (for MinIO and other providers).mode– switch betweenlist,download, andlist-and-downloaddepending on whether you only need metadata, specific keys, or listing + download.ignore-linebreaks– treat the entire object as a single event instead of splitting on newlines. Combine with preprocessors if you need to parse structured data afterwards.timestamp-mode– derive timestamps from last-modified, object-name patterns, or disable timestamp filtering.include-regex/exclude-regex/maximum-age– reduce the candidate list by pattern or age (accepts durations like15m,6h30m).fingerprinting/maximum-fingerprint-age– avoid reprocessing objects and control dedupe retention.access-key,secret-key,security-token,session-token,role-arn– authentication. Provide the combination required by your platform. Values can be literal strings or injected via context variables.preprocessors– gzip/parquet/base64/extension handlers applied after download.
Example: list and download Parquet objects
input: s3: bucket-name: analytics-prod object-names: - exports/daily/ mode: list-and-download include-regex: - "\\.parquet(\.gz)?$" maximum-age: 12h fingerprinting: true timestamp-mode: last-modified access-key: ${secrets.s3_access_key} secret-key: ${secrets.s3_secret_key} preprocessors: - parquetThis configuration lists objects beneath exports/daily/, filters to recent Parquet files, downloads each object once, and converts rows into JSON events.
Example: download a specific object as a single event
input: s3: bucket-name: analytics-prod object-names: - raw/snapshot.json mode: download ignore-linebreaks: true access-key: ${secrets.s3_access_key} secret-key: ${secrets.s3_secret_key} preprocessors: - extensionConfigure Lyft Data to write to S3
Add the s3 output to a job. Key fields:
bucket-name– destination bucket (required).object-name– literal or field-derived destination. Usename: ...for a fixed key orfield: ...to reuse an event field.mode–put(default) uploads objects;deleteremoves objects by key.disable-object-name-guid,guid-prefix,guid-suffix– control GUID-based dedupe for writes; disabling requires that generated names are already unique.input-field– source field to upload; omit to serialize the entire event after preprocessors.batch&retry– optimize throughput and resiliency.track-schema– maintain__SCHEMA_NUMBERwhen writing JSON payloads.access-key,secret-key,security-token,session-token,role-arn– same authentication options as the input.preprocessors– gzip/base64/extension handlers executed before upload.
Example: upload processed events with dynamic keys
output: s3: bucket-name: analytics-prod-archive object-name: name: processed/${{ event.partition }}/summary.json disable-object-name-guid: true input-field: payload preprocessors: - gzip track-schema: true access-key: ${secrets.s3_writer_access_key} secret-key: ${secrets.s3_writer_secret_key}Example: delete the original object after processing
output: s3: bucket-name: analytics-prod object-name: field: object_name mode: delete access-key: ${secrets.s3_writer_access_key} secret-key: ${secrets.s3_writer_secret_key}The delete output expects each incoming event to include object_name (for example from the S3 input). Delete operations do not generate GUID prefixes.
Recommendations for buckets and folders
- Keep individual objects below 100–150 MB (compressed) for predictable processing latency.
- Organise exports using partition-style prefixes such as
Y/M/,Y/M/D/, orY/M/D/H/.
Creating an AWS policy for programmatic access
Grant the job permissions to list and read/write objects. Replace BUCKETNAME with your bucket name:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "List", "Effect": "Allow", "Action": "s3:ListBucket", "Resource": "arn:aws:s3:::BUCKETNAME" }, { "Sid": "RW", "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"], "Resource": "arn:aws:s3:::BUCKETNAME/*" } ]}