Overview
ScrapAI can automatically upload crawl results to S3-compatible object storage for backup, archiving, and sharing across systems. This feature currently works with Airflow workflows.Current Limitation: S3 upload is only supported in Airflow workflows. Regular
./scrapai crawl commands do not upload to S3 yet.Supported Providers
Works with any S3-compatible object storage:- Hetzner Object Storage ⭐ (Recommended) - European provider, excellent pricing, free egress
- AWS S3 - Amazon’s original object storage
- DigitalOcean Spaces - Simple, developer-friendly
- Wasabi - Hot cloud storage, cheaper than S3
- Backblaze B2 - Cost-effective alternative
- Cloudflare R2 - Zero egress fees
- MinIO - Self-hosted S3-compatible storage
- Any other S3-compatible provider
Configuration
Environment Variables
Add S3 credentials to.env:
S3-compatible storage access key.Get this from your storage provider’s dashboard or API settings.
S3-compatible storage secret key.Keep this secure - never commit to version control.
S3-compatible storage endpoint URL.Examples:
- Hetzner:
https://fsn1.your-objectstorage.com - DigitalOcean:
https://nyc3.digitaloceanspaces.com - Wasabi:
https://s3.us-east-1.wasabisys.com - AWS S3:
https://s3.us-east-1.amazonaws.com
S3 bucket name for storing crawl results.Example:
scrapai-crawlsCreate the bucket in your storage provider before enabling uploads.Provider-Specific Setup
Hetzner Object Storage (Recommended)
Endpoint format:https://fsn1.your-objectstorage.com (regions: fsn1, hel1, nbg1)
DigitalOcean Spaces
Endpoint format:https://nyc3.digitaloceanspaces.com (regions: nyc3, sfo3, sgp1, ams3)
Wasabi
Endpoint format:https://s3.us-east-1.wasabisys.com (check Wasabi docs for region endpoints)
AWS S3
Endpoint format:https://s3.us-east-1.amazonaws.com
Cloudflare R2
Endpoint format:https://<account_id>.r2.cloudflarestorage.com
Other Providers
Works with Backblaze B2, MinIO (self-hosted), and any S3-compatible service. Check your provider’s documentation for endpoint URLs.How It Works
When a spider completes in Airflow, crawl results are saved locally toDATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl. If S3 is configured, files are automatically uploaded to s3://<bucket>/<spider_name>/crawl_TIMESTAMP.jsonl in the background. Local files are preserved - S3 is for backup/archiving only.
Verification
Airflow DAGs log S3 status on startup:S3 Upload: ENABLED or DISABLED (credentials not found). If disabled, verify all 4 environment variables are set.
Test connection using AWS CLI:
Use Cases
- Backup & Disaster Recovery - Automatic off-site backup with easy restoration
- Data Sharing - Centralized storage for team access across systems
- Archiving - Long-term storage cheaper than local SSD/NVMe
- Multi-Region Deployment - Centralize data from crawlers in multiple regions
Cost Considerations
Storage: ~0.023/GB/month (Hetzner and Backblaze cheapest) Egress (downloads): Free for Hetzner, Wasabi, and Cloudflare R2. AWS charges $0.09/GB.Troubleshooting
Upload not happening: Verify all 4 S3 variables are set in.env, check Airflow logs, and confirm bucket exists. Remember: regular ./scrapai crawl does NOT upload to S3 (only Airflow workflows).
Authentication error: Test with aws s3 ls --endpoint-url $S3_ENDPOINT. Verify credentials, endpoint URL, and bucket permissions.
Files not appearing: Check Airflow logs for upload task status and browse bucket with aws s3 ls s3://your-bucket/ --endpoint-url $S3_ENDPOINT.
Upload too slow: Choose provider closer to your location or with better bandwidth.
Bucket access denied: Verify bucket exists and check IAM/access policies. For AWS, ensure user has s3:PutObject, s3:GetObject, and s3:ListBucket permissions.
Manual Upload
For non-Airflow crawls, use AWS CLI to manually upload:Best Practices
- Create and test bucket before enabling uploads
- Use meaningful names like
scrapai-prod,scrapai-dev - Enable versioning to protect against overwrites (optional)
- Set lifecycle policies to auto-delete old files (optional)
- Monitor storage costs and clean up unused data
- Secure credentials - never commit
.env, rotate keys periodically