Backup and Recovery

A resilient Lyft Data deployment needs regular backups and rehearsed recovery procedures. This runbook captures what to back up, how often to do it, and how to validate restores.

What to back up

Component	Why it matters	Suggested cadence
Job definitions	Source of truth for every pipeline	Nightly
Server configuration (`server.yaml`, env vars)	Controls ports, auth, storage paths	Weekly or whenever changed
Worker configuration	Required to restore external workers quickly	Weekly
Staging database / metadata	Tracks deployments, state, and history	Daily snapshots
SSL certificates & API keys	Needed for secure communication	Aligned to rotation schedule
Logs / audit trails	Useful for forensics and compliance	Daily export with 30–90 day retention

Quick export commands

# Export all jobs to a dated directory and compress it
EXPORT_ROOT=backups/jobs-$(date +%Y%m%d)
lyftdata jobs export --dir "$EXPORT_ROOT"
tar -czf "${EXPORT_ROOT}.tar.gz" -C "$(dirname "$EXPORT_ROOT")" "$(basename "$EXPORT_ROOT")"

# Snapshot server configuration
tar -czf backups/server-config-$(date +%Y%m%d).tar.gz \
  /etc/lyftdata/server.yaml \
  /etc/lyftdata/env \
  /etc/lyftdata/certs

# Dump built-in SQLite metadata
tar -czf backups/staging-db-$(date +%Y%m%d).tar.gz /var/lib/lyftdata/staging.db

Store backups in two places: fast local storage for quick restores and offsite object storage for disasters. Encrypt sensitive archives before upload.

Automate daily configuration backup

#!/usr/bin/env bash
set -euo pipefail
BACKUP_DIR="/var/backups/lyftdata"
DATE=$(date +%Y%m%d-%H%M%S)
mkdir -p "$BACKUP_DIR"

EXPORT_DIR="$BACKUP_DIR/jobs-$DATE"
lyftdata jobs export --dir "$EXPORT_DIR"
tar -czf "$BACKUP_DIR/jobs-$DATE.tar.gz" -C "$BACKUP_DIR" "jobs-$DATE"
rm -rf "$EXPORT_DIR"

cp /etc/lyftdata/server.yaml "$BACKUP_DIR/server-$DATE.yaml"
cp /etc/lyftdata/env "$BACKUP_DIR/env-$DATE"
tar -czf "$BACKUP_DIR/staging-db-$DATE.tar.gz" /var/lib/lyftdata/staging.db
find "$BACKUP_DIR" -type f -mtime +30 -delete

Validate backups

Run automated checks to ensure archives are usable:

#!/usr/bin/env bash
BACKUP="$1"
TEMP_DIR=$(mktemp -d)
trap 'rm -rf "$TEMP_DIR"' EXIT

tar -xzf "$BACKUP" -C "$TEMP_DIR"
EXPORT_DIR=$(find "$TEMP_DIR" -maxdepth 1 -type d -name 'jobs-*' -print -quit)
if [ -z "$EXPORT_DIR" ]; then
  echo "could not locate exported jobs directory" >&2
  exit 1
fi

# Dry-run import confirms the job definitions still load
lyftdata jobs import --dry-run --dir "$EXPORT_DIR"

# Optional: lint server.yaml with your preferred YAML tool before redeploying

Schedule validation weekly; surface failures in monitoring.

Recovery steps

Restore the server:
- Rebuild host or container
- Copy back server.yaml, env files, certificates
- Restore the staging database and restart the service
Re-register workers:
- Install worker binaries
- Restore worker configuration/API keys
- Confirm they appear via the Workers page or curl -s http://<server>:3000/api/workers | jq '.[].id'
Redeploy jobs:
- Extract the latest job archive and run lyftdata jobs import --dir <path> --update
- Confirm job state matches expectations in the UI
Validate:
- Run canary jobs or sample pipelines
- Watch metrics/logs for the first hour

Disaster recovery tips

Keep infrastructure-as-code scripts handy to recreate servers and workers in new regions.
Document RPO/RTO expectations (e.g., 1 hour of data loss max, four-hour recovery window).
Test restore procedures quarterly to ensure runbooks stay current.

See also: Monitoring Lyft Data for detection signals and the troubleshooting guide for incident triage.