S3 Storage Configuration

Overview

scrapai can automatically upload crawl results to S3-compatible object storage for backup, archiving, and sharing across systems. When S3 is configured, every successful production crawl uploads its output on completion — no extra step.

Auto-upload runs after a production crawl (a ./scrapai crawl without --limit) finishes successfully. Test crawls (--limit N) save to the database and are not uploaded. If S3 credentials aren’t set, crawls run normally and simply skip the upload.

Supported Providers

Works with any S3-compatible object storage:

Hetzner Object Storage ⭐ (Recommended) - European provider, excellent pricing, free egress
AWS S3 - Amazon’s original object storage
DigitalOcean Spaces - Simple, developer-friendly
Wasabi - Hot cloud storage, cheaper than S3
Backblaze B2 - Cost-effective alternative
Cloudflare R2 - Zero egress fees
MinIO - Self-hosted S3-compatible storage
Any other S3-compatible provider

Configuration

Environment Variables

Add S3 credentials to .env:

# S3-Compatible Object Storage
S3_ACCESS_KEY=your_access_key_here
S3_SECRET_KEY=your_secret_key_here
S3_ENDPOINT=https://fsn1.your-objectstorage.com
S3_BUCKET=scrapai-crawls

S3_ACCESS_KEY

string

required

S3-compatible storage access key.Get this from your storage provider’s dashboard or API settings.

S3_SECRET_KEY

string

required

S3-compatible storage secret key.Keep this secure - never commit to version control.

S3_ENDPOINT

string

required

S3-compatible storage endpoint URL.Examples:

Hetzner: https://fsn1.your-objectstorage.com
DigitalOcean: https://nyc3.digitaloceanspaces.com
Wasabi: https://s3.us-east-1.wasabisys.com
AWS S3: https://s3.us-east-1.amazonaws.com

S3_BUCKET

string

required

S3 bucket name for storing crawl results.Example: scrapai-crawlsCreate the bucket in your storage provider before enabling uploads.

Provider-Specific Setup

Hetzner Object Storage (Recommended)

Endpoint format: https://fsn1.your-objectstorage.com (regions: fsn1, hel1, nbg1)

Hetzner offers free egress - no charges for downloading your data.

DigitalOcean Spaces

Endpoint format: https://nyc3.digitaloceanspaces.com (regions: nyc3, sfo3, sgp1, ams3)

Wasabi

Endpoint format: https://s3.us-east-1.wasabisys.com (check Wasabi docs for region endpoints)

AWS S3

Endpoint format: https://s3.us-east-1.amazonaws.com

AWS S3 has high egress costs ($0.09/GB). Consider providers with free egress for frequent downloads.

Cloudflare R2

Endpoint format: https://<account_id>.r2.cloudflarestorage.com

Cloudflare R2 has zero egress fees - best for frequent data downloads.

Other Providers

Works with Backblaze B2, MinIO (self-hosted), and any S3-compatible service. Check your provider’s documentation for endpoint URLs.

How It Works

When a production crawl finishes successfully, its output file is written to DATA_DIR/<project>/<spider>/crawls/. If S3 is configured, scrapai then:

Compresses the JSONL file with gzip (.jsonl.gz) to save bandwidth and storage.
Uploads it to s3://<bucket>/<project>/<spider>/crawls/<filename>.jsonl.gz, preserving the project/spider structure.
Deletes the local file after a successful upload, so S3 is the system of record for finished crawls.

If the upload fails, the local file is kept so nothing is lost.

Verification

During a production crawl you’ll see the upload happen inline:

📤 Uploading to S3...
✅ Upload to S3 completed

If S3 isn’t configured the step is skipped silently. Test the connection independently with the AWS CLI:

aws s3 ls s3://your-bucket/ --endpoint-url https://your-endpoint.com

Use Cases

Backup & Disaster Recovery - Automatic off-site backup with easy restoration
Data Sharing - Centralized storage for team access across systems
Archiving - Long-term storage cheaper than local SSD/NVMe
Multi-Region Deployment - Centralize data from crawlers in multiple regions

Cost Considerations

Storage: ~

0.005-

0.023/GB/month (Hetzner and Backblaze cheapest) Egress (downloads): Free for Hetzner, Wasabi, and Cloudflare R2. AWS charges $0.09/GB.

For frequent downloads, use Cloudflare R2, Hetzner, or Wasabi for free egress.

Troubleshooting

Upload not happening: Verify all 4 S3 variables are set in .env and the bucket exists. Remember: only production crawls upload — a test crawl with --limit saves to the database and skips S3. Authentication error: Test with aws s3 ls --endpoint-url $S3_ENDPOINT. Verify credentials, endpoint URL, and bucket permissions. Files not appearing: Confirm the crawl completed successfully (a failed crawl keeps its file local and does not upload), then browse the bucket with aws s3 ls s3://your-bucket/ --endpoint-url $S3_ENDPOINT. Upload too slow: Choose provider closer to your location or with better bandwidth. Bucket access denied: Verify bucket exists and check IAM/access policies. For AWS, ensure user has s3:PutObject, s3:GetObject, and s3:ListBucket permissions.

Manual Upload

To upload a test-crawl file or re-upload an old crawl by hand, use the AWS CLI:

aws s3 cp DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl \
  s3://your-bucket/<project>/<spider>/crawls/ --endpoint-url $S3_ENDPOINT

Best Practices

Create and test bucket before enabling uploads
Use meaningful names like scrapai-prod, scrapai-dev
Enable versioning to protect against overwrites (optional)
Set lifecycle policies to auto-delete old files (optional)
Monitor storage costs and clean up unused data
Secure credentials - never commit .env, rotate keys periodically

When to Use S3 Storage

Use S3 for data backup/archiving, sharing finished crawls across systems, or off-site storage of large crawls.

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

S3 Storage Configuration

Overview

Supported Providers

Configuration

Environment Variables

Provider-Specific Setup

Hetzner Object Storage (Recommended)

DigitalOcean Spaces

Wasabi

AWS S3

Cloudflare R2

Other Providers

How It Works

Verification

Use Cases

Cost Considerations

Troubleshooting

Manual Upload

Best Practices

When to Use S3 Storage

​Overview

​Supported Providers

​Configuration

​Environment Variables

​Provider-Specific Setup

​Hetzner Object Storage (Recommended)

​DigitalOcean Spaces

​Wasabi

​AWS S3

​Cloudflare R2

​Other Providers

​How It Works

​Verification

​Use Cases

​Cost Considerations

​Troubleshooting

​Manual Upload

​Best Practices

​When to Use S3 Storage

​Related Documentation

Overview

Supported Providers

Configuration

Environment Variables

Provider-Specific Setup

Hetzner Object Storage (Recommended)

DigitalOcean Spaces

Wasabi

AWS S3

Cloudflare R2

Other Providers

How It Works

Verification

Use Cases

Cost Considerations

Troubleshooting

Manual Upload

Best Practices

When to Use S3 Storage

Related Documentation