Skip to main content

Overview

ScrapAI CLI uses environment variables for configuration. All settings are stored in a .env file in the project root directory.

Initial Setup

The .env file is automatically created during setup:
./scrapai setup
To manually create or modify your configuration:
  1. Copy the example file:
    cp .env.example .env
    
  2. Edit .env with your preferred settings
  3. Restart any running processes to apply changes
The .env file is gitignored by default. Never commit credentials to version control.

Core Environment Variables

Data Directory

DATA_DIR
string
default:"./data"
Directory where all scraped data, analysis, and artifacts are stored.Example:
DATA_DIR=./data
All crawl results, spider configurations, and project data are organized under this directory:
data/
├── project1/
│   ├── spider1/
│   │   ├── crawls/
│   │   ├── analysis/
│   │   └── spider.json
│   └── spider2/
└── project2/

Database Configuration

DATABASE_URL
string
default:"sqlite:///scrapai.db"
Database connection string. Supports SQLite and PostgreSQL.SQLite (default):
DATABASE_URL=sqlite:///scrapai.db
PostgreSQL:
DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
See Database Configuration for details.

Logging

LOG_LEVEL
string
default:"info"
Logging verbosity level.Options:
  • debug - Detailed debugging information
  • info - General informational messages (recommended)
  • warning - Warning messages only
  • error - Error messages only
Example:
LOG_LEVEL=info
LOG_DIR
string
default:"./logs"
Directory for log files.Example:
LOG_DIR=./logs

Optional Services

Proxy Configuration

DATACENTER_PROXY_USERNAME
string
Username for datacenter proxy authentication.See Proxy Configuration for complete setup.
DATACENTER_PROXY_PASSWORD
string
Password for datacenter proxy authentication.
DATACENTER_PROXY_HOST
string
Datacenter proxy server hostname.Example: dc.yourproxy.com
DATACENTER_PROXY_PORT
number
Datacenter proxy server port.Example: 10000 (rotating IPs)
RESIDENTIAL_PROXY_USERNAME
string
Username for residential proxy authentication.
RESIDENTIAL_PROXY_PASSWORD
string
Password for residential proxy authentication.
RESIDENTIAL_PROXY_HOST
string
Residential proxy server hostname.Example: residential.yourproxy.com
RESIDENTIAL_PROXY_PORT
number
Residential proxy server port.Example: 7000 (rotating residential IPs)

S3 Storage Configuration

S3_ACCESS_KEY
string
S3-compatible storage access key.See S3 Storage Configuration for complete setup.
S3_SECRET_KEY
string
S3-compatible storage secret key.
S3_ENDPOINT
string
S3-compatible storage endpoint URL.Example: https://fsn1.your-objectstorage.com
S3_BUCKET
string
S3 bucket name for storing crawl results.Example: scrapai-crawls

Airflow Configuration

_AIRFLOW_WWW_USER_USERNAME
string
default:"admin"
Airflow web UI admin username (for docker-compose.airflow.yml).
_AIRFLOW_WWW_USER_PASSWORD
string
default:"changeme_REPLACE_THIS"
Airflow web UI admin password.
Change the default password before deploying to production.
AIRFLOW_UID
number
default:"501"
User ID for Airflow processes in Docker.

Environment File Example

Complete .env file with all available options:
# ScrapAI Configuration

# Data directory - where all scraped data, analysis, and artifacts are stored
DATA_DIR=./data

# Database connection string
# Default: SQLite (no setup required, perfect for getting started)
# For production/scale: use PostgreSQL
DATABASE_URL=sqlite:///scrapai.db

# Proxy Configuration (SmartProxyMiddleware)
# Uncomment and fill in your proxy details to enable

# Datacenter Proxy (recommended for most use cases - faster, cheaper)
# DATACENTER_PROXY_USERNAME=your_username
# DATACENTER_PROXY_PASSWORD=your_password
# DATACENTER_PROXY_HOST=dc.yourproxy.com
# DATACENTER_PROXY_PORT=10000

# Residential Proxy (used with --proxy-type residential flag)
# RESIDENTIAL_PROXY_USERNAME=your_username
# RESIDENTIAL_PROXY_PASSWORD=your_password
# RESIDENTIAL_PROXY_HOST=residential.yourproxy.com
# RESIDENTIAL_PROXY_PORT=8000

# Logging
# LOG_LEVEL=info
# LOG_DIR=./logs

# Airflow Configuration (for docker-compose.airflow.yml)
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=changeme_REPLACE_THIS
AIRFLOW_UID=501

# S3-Compatible Object Storage
# If these are set, scraped data will be automatically uploaded to S3 after crawling
# S3_ACCESS_KEY=your_access_key_here
# S3_SECRET_KEY=your_secret_key_here
# S3_ENDPOINT=https://fsn1.your-objectstorage.com
# S3_BUCKET=scrapai-crawls

Loading Configuration

Environment variables are automatically loaded from .env when you run any ScrapAI command:
./scrapai crawl spider_name --project myproject
The configuration is loaded in this order:
  1. .env file in project root
  2. System environment variables (override .env)
  3. Default values (if not set)
You can override any .env setting by exporting an environment variable:
export LOG_LEVEL=debug
./scrapai crawl spider_name --project myproject

Validation

Verify your configuration is loaded correctly:
./scrapai verify
This command checks:
  • Python environment
  • Database connectivity
  • Required directories
  • Optional service configuration

Security Best Practices

Never commit .env to version controlThe .env file contains sensitive credentials and should be gitignored.
  1. Use .env.example as a template
    • Commit .env.example with placeholder values
    • Keep actual credentials in .env only
  2. Restrict file permissions
    chmod 600 .env
    
  3. Rotate credentials regularly
    • Change database passwords
    • Regenerate API keys
    • Update proxy credentials
  4. Use different credentials per environment
    • Development: .env
    • Production: .env.production
    • Testing: .env.test

Troubleshooting

Changes not taking effect

  1. Verify .env file exists in project root:
    ls -la .env
    
  2. Check file syntax (no spaces around =):
    # Correct
    DATA_DIR=./data
    
    # Incorrect
    DATA_DIR = ./data
    
  3. Restart any running processes

Configuration not found

If ScrapAI can’t find your .env file:
# Create from example
cp .env.example .env

# Run setup
./scrapai setup

Permission denied

If you get permission errors:
# Fix .env permissions
chmod 600 .env

# Fix data directory permissions
chmod 755 data/