Skip to main content

Overview

ScrapAI CLI uses environment variables stored in a .env file in the project root. The file is automatically created during ./scrapai setup or can be manually created from .env.example.
The .env file is gitignored by default. Never commit credentials to version control.

Core Environment Variables

Data Directory

DATA_DIR
string
default:"./data"
Directory where all scraped data, analysis, and artifacts are stored.

Database Configuration

DATABASE_URL
string
default:"sqlite:///scrapai.db"
Database connection string. Supports SQLite (default) and PostgreSQL.See Database Configuration for details.

Logging

LOG_LEVEL
string
default:"info"
Logging verbosity: debug, info, warning, or error.
LOG_DIR
string
default:"./logs"
Directory for log files.

Optional Services

Proxy Configuration

DATACENTER_PROXY_USERNAME
string
Username for datacenter proxy authentication.See Proxy Configuration for complete setup.
DATACENTER_PROXY_PASSWORD
string
Password for datacenter proxy authentication.
DATACENTER_PROXY_HOST
string
Datacenter proxy server hostname.
DATACENTER_PROXY_PORT
number
Datacenter proxy server port.
RESIDENTIAL_PROXY_USERNAME
string
Username for residential proxy authentication.
RESIDENTIAL_PROXY_PASSWORD
string
Password for residential proxy authentication.
RESIDENTIAL_PROXY_HOST
string
Residential proxy server hostname.
RESIDENTIAL_PROXY_PORT
number
Residential proxy server port.

S3 Storage Configuration

S3_ACCESS_KEY
string
S3-compatible storage access key.See S3 Storage Configuration for complete setup.
S3_SECRET_KEY
string
S3-compatible storage secret key.
S3_ENDPOINT
string
S3-compatible storage endpoint URL.
S3_BUCKET
string
S3 bucket name for storing crawl results.

Airflow Configuration

_AIRFLOW_WWW_USER_USERNAME
string
default:"admin"
Airflow web UI admin username (for docker-compose.airflow.yml).
_AIRFLOW_WWW_USER_PASSWORD
string
default:"changeme_REPLACE_THIS"
Airflow web UI admin password.
Change the default password before deploying to production.
AIRFLOW_UID
number
default:"501"
User ID for Airflow processes in Docker.

Environment File Example

# Core
DATA_DIR=./data
DATABASE_URL=sqlite:///scrapai.db

# Logging
LOG_LEVEL=info
LOG_DIR=./logs

# Airflow (for docker-compose.airflow.yml)
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=changeme_REPLACE_THIS
AIRFLOW_UID=501

# Optional: Proxies (see /configuration/proxies)
# DATACENTER_PROXY_USERNAME=
# DATACENTER_PROXY_PASSWORD=
# DATACENTER_PROXY_HOST=
# DATACENTER_PROXY_PORT=

# Optional: S3 Storage (see /configuration/s3-storage)
# S3_ACCESS_KEY=
# S3_SECRET_KEY=
# S3_ENDPOINT=
# S3_BUCKET=

Loading Configuration

Environment variables are automatically loaded from .env when you run any ScrapAI command. Loading order: .env file → system environment variables → defaults.
Override .env settings with exported environment variables:
export LOG_LEVEL=debug
./scrapai crawl spider_name --project myproject

Validation

Verify your configuration:
./scrapai verify

Security Best Practices

Never commit .env to version control. The file contains sensitive credentials and is gitignored by default.
  1. Use .env.example as a template with placeholder values
  2. Restrict permissions: chmod 600 .env
  3. Rotate credentials regularly
  4. Use different credentials per environment (.env, .env.production, .env.test)

Troubleshooting

Changes not taking effect:
  • Verify .env exists in project root
  • Check syntax (no spaces around =)
  • Restart running processes
Configuration not found:
cp .env.example .env
./scrapai setup
Permission denied:
chmod 600 .env
chmod 755 data/