From Sample Job to Production Connector
This guide is for data engineers who have completed the Day 0 quick start and want to wire in real data. It focuses on choosing the right connector, validating transformations, and promoting a job safely.
1. Capture requirements (10 minutes)
- Business goal: what question or downstream system are you serving?
- Data source: protocol (files, object store, HTTP API, database dump), expected frequency, size, and authentication.
- Destination: target format and ingestion expectations (batch vs streaming, retention requirements).
Document these answers; they drive connector selection and batching decisions.
2. Pick the right input (10 minutes)
Use the Build catalog to choose an input:
- Object stores: S3, GCS, Azure Blob, or FileStore for on-prem.
- APIs: HTTP Poll or HTTP Server depending on push vs pull models.
- Files: files for tailing logs or processing directories.
3. Design transformations (15 minutes)
- Map input fields to the schema you need downstream.
- Identify enrichment needs (lookups, timestamps, context values).
- Select actions: Actions overview describes field edits, filters, scripts, and enrichers.
- Plan for error handling (discard vs reroute) and record outlier handling.
4. Configure the job in the visual editor (30 minutes)
- Start with a copy of the default job or create a new job in the editor.
- Swap the input for the real connector and fill required fields (bucket, keys, URL, credentials).
- Add actions for transformations and enrichment. Use add, convert, filter, or enrich as needed.
- Configure the output (S3, HTTP, file-store, etc.) with batching if required.
- Use Run & Trace with sample data to validate the end-to-end flow. Adjust until the output matches expectations.
5. Handle secrets and context (10 minutes)
- Use job context values for API keys or environment-dependent settings.
- Document required environment variables and verify they are covered in staging/production.
- Reference the context management guide for merge rules and overrides.
6. Stage, test, and promote (20 minutes)
- Stage the job and deploy it to a non-production worker first.
- Validate metrics and logs after a sample run. Look for retries, error counts, or shape mismatches.
- Update runbooks with monitoring requirements (dashboards, alerts). See the Monitoring guide for key metrics.
- When satisfied, deploy to production workers and monitor closely during the first full run.
7. Share status and iterate
- Record the job’s purpose, owners, and SLA in your team docs.
- Schedule periodic reviews with downstream consumers to confirm the pipeline meets their needs.
- Add lessons learned back into the build tutorials so future jobs benefit.
Quick checklist
- Requirements documented (source, destination, schedule, success criteria)
- Input connector selected and tested with sample data
- Actions configured and validated using Run & Trace
- Output batching and delivery confirmed
- Context/environment variables defined
- Job staged, promoted, and monitored in production
Once you’re comfortable building manually, graduate to automation with the CI/CD guide. The actions, batching, and context patterns above form the foundation of every production job.