Skip to main content

Overview

ScrapAI can automatically upload crawl results to S3-compatible object storage for backup, archiving, and sharing across systems. This feature currently works with Airflow workflows.
Current Limitation: S3 upload is only supported in Airflow workflows. Regular ./scrapai crawl commands do not upload to S3 yet.

Supported Providers

Works with any S3-compatible object storage:
  • Hetzner Object Storage(Recommended) - European provider, excellent pricing, free egress
  • AWS S3 - Amazon’s original object storage
  • DigitalOcean Spaces - Simple, developer-friendly
  • Wasabi - Hot cloud storage, cheaper than S3
  • Backblaze B2 - Cost-effective alternative
  • Cloudflare R2 - Zero egress fees
  • MinIO - Self-hosted S3-compatible storage
  • Any other S3-compatible provider

Configuration

Environment Variables

Add S3 credentials to .env:
# S3-Compatible Object Storage
S3_ACCESS_KEY=your_access_key_here
S3_SECRET_KEY=your_secret_key_here
S3_ENDPOINT=https://fsn1.your-objectstorage.com
S3_BUCKET=scrapai-crawls
S3_ACCESS_KEY
string
required
S3-compatible storage access key.Get this from your storage provider’s dashboard or API settings.
S3_SECRET_KEY
string
required
S3-compatible storage secret key.Keep this secure - never commit to version control.
S3_ENDPOINT
string
required
S3-compatible storage endpoint URL.Examples:
  • Hetzner: https://fsn1.your-objectstorage.com
  • DigitalOcean: https://nyc3.digitaloceanspaces.com
  • Wasabi: https://s3.us-east-1.wasabisys.com
  • AWS S3: https://s3.us-east-1.amazonaws.com
S3_BUCKET
string
required
S3 bucket name for storing crawl results.Example: scrapai-crawlsCreate the bucket in your storage provider before enabling uploads.

Provider-Specific Setup

Endpoint format: https://fsn1.your-objectstorage.com (regions: fsn1, hel1, nbg1)
Hetzner offers free egress - no charges for downloading your data.

DigitalOcean Spaces

Endpoint format: https://nyc3.digitaloceanspaces.com (regions: nyc3, sfo3, sgp1, ams3)

Wasabi

Endpoint format: https://s3.us-east-1.wasabisys.com (check Wasabi docs for region endpoints)

AWS S3

Endpoint format: https://s3.us-east-1.amazonaws.com
AWS S3 has high egress costs ($0.09/GB). Consider providers with free egress for frequent downloads.

Cloudflare R2

Endpoint format: https://<account_id>.r2.cloudflarestorage.com
Cloudflare R2 has zero egress fees - best for frequent data downloads.

Other Providers

Works with Backblaze B2, MinIO (self-hosted), and any S3-compatible service. Check your provider’s documentation for endpoint URLs.

How It Works

When a spider completes in Airflow, crawl results are saved locally to DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl. If S3 is configured, files are automatically uploaded to s3://<bucket>/<spider_name>/crawl_TIMESTAMP.jsonl in the background. Local files are preserved - S3 is for backup/archiving only.

Verification

Airflow DAGs log S3 status on startup: S3 Upload: ENABLED or DISABLED (credentials not found). If disabled, verify all 4 environment variables are set. Test connection using AWS CLI:
aws s3 ls s3://your-bucket/ --endpoint-url https://your-endpoint.com

Use Cases

  • Backup & Disaster Recovery - Automatic off-site backup with easy restoration
  • Data Sharing - Centralized storage for team access across systems
  • Archiving - Long-term storage cheaper than local SSD/NVMe
  • Multi-Region Deployment - Centralize data from crawlers in multiple regions

Cost Considerations

Storage: ~0.0050.005-0.023/GB/month (Hetzner and Backblaze cheapest) Egress (downloads): Free for Hetzner, Wasabi, and Cloudflare R2. AWS charges $0.09/GB.
For frequent downloads, use Cloudflare R2, Hetzner, or Wasabi for free egress.

Troubleshooting

Upload not happening: Verify all 4 S3 variables are set in .env, check Airflow logs, and confirm bucket exists. Remember: regular ./scrapai crawl does NOT upload to S3 (only Airflow workflows). Authentication error: Test with aws s3 ls --endpoint-url $S3_ENDPOINT. Verify credentials, endpoint URL, and bucket permissions. Files not appearing: Check Airflow logs for upload task status and browse bucket with aws s3 ls s3://your-bucket/ --endpoint-url $S3_ENDPOINT. Upload too slow: Choose provider closer to your location or with better bandwidth. Bucket access denied: Verify bucket exists and check IAM/access policies. For AWS, ensure user has s3:PutObject, s3:GetObject, and s3:ListBucket permissions.

Manual Upload

For non-Airflow crawls, use AWS CLI to manually upload:
aws s3 cp DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl \
  s3://your-bucket/<spider_name>/ --endpoint-url $S3_ENDPOINT

Best Practices

  1. Create and test bucket before enabling uploads
  2. Use meaningful names like scrapai-prod, scrapai-dev
  3. Enable versioning to protect against overwrites (optional)
  4. Set lifecycle policies to auto-delete old files (optional)
  5. Monitor storage costs and clean up unused data
  6. Secure credentials - never commit .env, rotate keys periodically

When to Use S3 Storage

Use S3 for data backup/archiving, sharing across systems, off-site storage of large crawls, or when using Airflow workflows.