Skip to content

Reading From Amazon S3

Object-store inputs (S3, GCS, Azure Blob, and FileStore) all share the same capabilities. This tutorial shows how to configure the S3 variant to list objects, download payloads, and avoid reprocessing files.

Prerequisites

  • S3 credentials with read access to the bucket you want to ingest. Access key, secret key, and (if required) a session token or role ARN.
  • A worker that can reach the S3 endpoint. When using MinIO or another compatible service, provide its custom endpoint URL.
  • Familiarity with the Jobs visual editor.

1. Create the job and connect to S3

  1. Create a new job named s3-import and choose S3 as the input.
  2. Fill in Endpoint (leave blank for AWS), Bucket, Access key, and Secret key. Add a Session token or Role ARN if your environment requires them.
  3. Set Object names to the prefixes you want to read, for example logs/2025/ or exports/. S3 inputs treat these as prefix matches rather than glob patterns.
  4. Choose a Mode:
    • List objects emits one event per object with metadata only.
    • Download objects fetches the payload for each name you specify exactly.
    • List and download combines the two: it lists matching objects, filters them, then downloads each remaining object.

The runtime fingerprints processed objects, so repeat runs skip files that were already seen unless you clear the fingerprint cache.

2. Refine the candidate set

When you list or list-and-download, you can narrow the matches without changing the prefix:

  • Include regex: only keep object names that match one of the regular expressions.
  • Exclude regex: drop any object whose name matches the pattern.
  • Maximum age: ignore objects older than the configured number of seconds.

These filters run after the initial prefix scan, making it easy to pull subsets (for example *.jsonl files inside a daily partition).

3. Handle payload formats

Downloads can be preprocessed automatically:

  • Enable Ignore line breaks to treat the entire object as a single event instead of splitting by newline.
  • Set Preprocessors to gzip, parquet, base64, or extension depending on the file type. The runtime streams gzip transparently and loads Parquet files into JSON events once the download completes.
  • Use Events field when the payload is a JSON array and you want each array entry emitted as a separate event.

Combine these settings with downstream actions (for example json, csv, or enrich) to normalise the data before delivery.

4. Add actions and an output

Attach any transformations you need, then send the events to a suitable destination. During validation the Print output makes it easy to inspect the payload; in production you might forward to S3, Splunk HEC, or a worker channel.

5. Test with Run & Trace

Use Run & Trace to execute the job once and confirm the expected objects appear. The UI streams each event and its trace so you can verify metadata fields such as object_name, object_size, or custom headers. Adjust filters and preprocessors until the sample run matches your expectations.

6. Schedule and deploy

Most S3 jobs rely on the Trigger block to poll on a cadence:

  • Choose Interval (for example 15m) to run at a fixed frequency.
  • Choose Cron when you need runs at the top of the hour or another precise schedule.
  • Use message triggers to kick off ad-hoc replays from another pipeline.

After testing, save, stage, and deploy the job. Watch Operate > Job status to confirm new objects are processed and previously fingerprinted keys are skipped.

Operational tips

  • Keep the fingerprint database persistent between worker restarts so historic objects are not re-downloaded by accident.
  • Pair the job with Advanced scheduling when coordinating across multiple buckets or regions.
  • Use Dealing with time to stamp ingestion timestamps or convert vendor time formats during processing.
  • Follow the alerting guidance in Operate monitoring to surface repeated download failures or credential issues.