Overview
ScrapAI can automatically upload crawl results to S3-compatible object storage for backup, archiving, and sharing across systems. This feature currently works with Airflow workflows.Current Limitation: S3 upload is only supported in Airflow workflows. Regular
./scrapai crawl commands do not upload to S3 yet.Supported Providers
Works with any S3-compatible object storage:- Hetzner Object Storage ⭐ (Recommended) - European provider, excellent pricing, free egress
- AWS S3 - Amazon’s original object storage
- DigitalOcean Spaces - Simple, developer-friendly
- Wasabi - Hot cloud storage, cheaper than S3
- Backblaze B2 - Cost-effective alternative
- Cloudflare R2 - Zero egress fees
- MinIO - Self-hosted S3-compatible storage
- Any other S3-compatible provider
Configuration
Environment Variables
Add S3 credentials to.env:
S3-compatible storage access key.Get this from your storage provider’s dashboard or API settings.
S3-compatible storage secret key.Keep this secure - never commit to version control.
S3-compatible storage endpoint URL.Examples:
- Hetzner:
https://fsn1.your-objectstorage.com - DigitalOcean:
https://nyc3.digitaloceanspaces.com - Wasabi:
https://s3.us-east-1.wasabisys.com - AWS S3:
https://s3.us-east-1.amazonaws.com
S3 bucket name for storing crawl results.Example:
scrapai-crawlsCreate the bucket in your storage provider before enabling uploads.Provider-Specific Setup
Hetzner Object Storage (Recommended)
fsn1- Falkenstein, Germanyhel1- Helsinki, Finlandnbg1- Nuremberg, Germany
DigitalOcean Spaces
nyc3- New Yorksfo3- San Franciscosgp1- Singaporeams3- Amsterdam
Wasabi
s3.us-east-1.wasabisys.com- US East (Virginia)s3.us-east-2.wasabisys.com- US East (Virginia)s3.us-central-1.wasabisys.com- US Central (Texas)s3.us-west-1.wasabisys.com- US West (Oregon)s3.eu-central-1.wasabisys.com- EU Central (Amsterdam)s3.eu-central-2.wasabisys.com- EU Central (Frankfurt)s3.ap-northeast-1.wasabisys.com- Asia Pacific (Tokyo)
AWS S3
Backblaze B2
Cloudflare R2
MinIO (Self-Hosted)
How It Works
Upload Process
- Spider completes crawl in Airflow workflow
- Crawl results saved to
DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl - If S3 configured, Airflow task uploads file to S3 bucket
- Files remain locally even after upload
- S3 path:
s3://<bucket>/<spider_name>/crawl_TIMESTAMP.jsonl
Local Files
All crawl data remains in your localDATA_DIR even after S3 upload. S3 is for backup/archiving, not a replacement for local storage.
Current Implementation
- S3 upload is only supported in Airflow workflows
- Regular
./scrapai crawlcommands do not upload to S3 (yet) - Airflow DAGs automatically detect S3 configuration
- Upload happens in background after spider completes
- Does not block crawling or slow down scraping
Verification
Check Configuration
Airflow DAGs log S3 status on startup:S3_ACCESS_KEYS3_SECRET_KEYS3_ENDPOINTS3_BUCKET
Test Upload
Manually test S3 connection:Use Cases
Backup & Disaster Recovery
- Automatic off-site backup of all crawl data
- Protect against local disk failure
- Easy restoration from S3 if needed
Data Sharing
- Share crawl results across multiple systems
- Centralized data storage for team access
- Export data from one system, import on another
Archiving
- Long-term storage of historical crawls
- Cheaper than keeping on local SSD/NVMe
- Compliance and audit requirements
Multi-Region Deployment
- Run crawlers in multiple regions
- All data centralized in S3 bucket
- Access from anywhere
Cost Considerations
Storage Costs (per GB/month)
| Provider | Storage Cost | Notes |
|---|---|---|
| Hetzner | ~$0.005 | European provider |
| DigitalOcean Spaces | $0.02 | First 250GB free with droplet |
| Wasabi | $0.0059 | Minimum 1TB |
| AWS S3 | $0.023 | US East region |
| Backblaze B2 | $0.005 | Good value |
| Cloudflare R2 | $0.015 | Zero egress |
Egress Costs (downloading from S3)
| Provider | Egress Cost | Notes |
|---|---|---|
| Hetzner | Free | No egress charges |
| DigitalOcean | Free up to 1TB/month | Included with droplet |
| Wasabi | Free | No egress charges |
| AWS S3 | $0.09/GB | Expensive! |
| Backblaze B2 | $0.01/GB | Reasonable |
| Cloudflare R2 | Free | Zero egress fees |
Troubleshooting
S3 upload not happening
-
Check
.envhas all 4 S3 variables set: -
Verify credentials are correct:
- Confirm bucket exists and is accessible
- Check Airflow DAG logs for errors
-
Remember: Regular
./scrapai crawldoes NOT upload to S3 (only Airflow workflows)
Authentication error
- Double-check access key and secret key
- Verify endpoint URL is correct for your provider
- Ensure bucket exists in the same region as endpoint
- Check IAM permissions (AWS) or access policies (other providers)
Files not appearing in bucket
-
Check bucket name is correct:
- Verify upload task completed successfully in Airflow logs
-
Browse S3 bucket with web console or CLI:
-
Check S3 path:
s3://<bucket>/<spider_name>/crawl_*.jsonl
Upload too slow
- Check network bandwidth to S3 provider
- Consider provider closer to your location
- Split large crawls into smaller batches
- Use provider with faster upload speeds
Bucket access denied
-
Verify bucket exists:
- Check bucket permissions/policies
-
For AWS S3, verify IAM user has permissions:
Manual Upload Workaround
For non-Airflow crawls, manually upload after crawl:Best Practices
-
Create bucket before enabling uploads
- Verify bucket exists and is accessible
- Test credentials with manual upload
-
Use meaningful bucket names
scrapai-crawlsfor all crawl datascrapai-backupfor backups- Include environment:
scrapai-prod,scrapai-dev
-
Enable versioning (optional)
- Protect against accidental overwrites
- Keep history of crawl data
-
Set lifecycle policies (optional)
- Auto-delete old files after N days
- Move to cheaper storage tier after N days
-
Monitor storage costs
- Track bucket size growth
- Review egress charges
- Clean up old/unused data
-
Secure credentials
- Never commit
.envto git - Use IAM roles when possible (AWS)
- Rotate keys periodically
- Never commit
When to Use S3 Storage
Recommend S3 setup when:- User asks about data backup or archiving
- User needs to share crawl data across systems
- User mentions cloud storage (S3, Hetzner, DigitalOcean Spaces, etc.)
- User has large crawls and wants off-site storage
- User is using Airflow workflows
- User asks about data durability/redundancy