System Requirements
Python Version 3.9 or higher
Git For cloning the repository
Disk Space ~500 MB (dependencies + browser)
ScrapAI uses SQLite by default (no database installation required). For production scale, PostgreSQL is recommended.
Linux (Ubuntu, Debian, CentOS, Fedora, Arch)
macOS (Intel and Apple Silicon)
Windows (10/11 via WSL only)
Installation Steps
Install Python 3.9+
sudo apt update
sudo apt install python3.9 python3.9-venv python3-pip git
sudo dnf install python39 python39-pip git
sudo pacman -S python python-pip git
Verify installation: Install via Homebrew: brew install python@3.9 git
Or download from python.org Verify installation: Windows users must use WSL (Windows Subsystem for Linux). Install WSL , then follow the Linux instructions above.
Clone the repository
git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
Clone to a location with write permissions. Avoid system directories like /usr/local/.
Run setup
The setup process will:
Create virtual environment
Creates a .venv directory with isolated Python packages
Install dependencies
Installs Scrapy, SQLAlchemy, Alembic, newspaper4k, trafilatura, Playwright, and more
Install Playwright Chromium
Downloads Chromium browser for JavaScript rendering and Cloudflare bypass Linux users : If Chromium fails to launch later, you may need to install system dependencies:sudo .venv/bin/python -m playwright install-deps chromium
This requires sudo because it installs system packages (fonts, libraries, etc.).
Create .env file
Copies .env.example to .env with default SQLite configuration
Initialize database
Runs Alembic migrations to create the database schema
Configure Claude Code permissions
If using AI agents, sets up permission rules in .claude/settings.local.json
🚀 Setting up ScrapAI environment...
📦 Creating virtual environment...
✅ Virtual environment created
📋 Installing requirements...
✅ Requirements installed
🌐 Installing Playwright Chromium browser...
✅ Playwright Chromium installed
📝 Creating .env from .env.example...
✅ .env file created (using SQLite by default)
📁 Checking data directory permissions...
✅ Have permission to write to data directory: ./data
🗄️ Initializing database...
✅ Database initialized with migrations
🔧 Configuring Claude Code permissions...
✅ Claude Code permissions configured
🎉 ScrapAI setup complete!
Verify installation
You should see: 🔍 Verifying ScrapAI environment...
✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!
If any checks fail, re-run ./scrapai setup. If issues persist, see Troubleshooting below.
Configuration
Database Configuration
ScrapAI uses SQLite by default (no setup required). For production, you can transfer your existing data to PostgreSQL :
Get PostgreSQL
Use a managed service (AWS RDS, DigitalOcean, Supabase) or install locally: # Linux: sudo apt install postgresql postgresql-contrib
# macOS: brew install postgresql
Update .env
DATABASE_URL = postgresql://user:password@host:5432/scrapai
Run migrations and transfer
./scrapai db migrate
./scrapai db transfer sqlite:///scrapai.db
Proxy Configuration
ScrapAI supports smart proxy escalation. Configure proxies in .env:
# Datacenter Proxy (recommended - faster, cheaper)
DATACENTER_PROXY_USERNAME = your_username
DATACENTER_PROXY_PASSWORD = your_password
DATACENTER_PROXY_HOST = your-datacenter-proxy.com
DATACENTER_PROXY_PORT = 10000 # Port 10000 = rotating IPs
# Residential Proxy (for sites that block datacenter IPs)
RESIDENTIAL_PROXY_USERNAME = your_username
RESIDENTIAL_PROXY_PASSWORD = your_password
RESIDENTIAL_PROXY_HOST = your-residential-proxy.com
RESIDENTIAL_PROXY_PORT = 7000 # Port 7000 = rotating residential IPs
Proxies are optional. ScrapAI starts with direct connections and only uses proxies when needed (403/429 errors). It learns which domains require proxies and remembers for future crawls.
S3 Storage Configuration
For automatic uploads to S3-compatible storage (Hetzner, DigitalOcean Spaces, Wasabi, Backblaze, etc.):
S3_ACCESS_KEY = your_access_key_here
S3_SECRET_KEY = your_secret_key_here
S3_ENDPOINT = https://your-s3-endpoint.com
S3_BUCKET = your-bucket-name
Troubleshooting
Virtual environment creation fails
Error : The virtual environment was not created successfullySolution :# Install venv module
sudo apt install python3.9-venv # Ubuntu/Debian
# Or use a different Python version
python3.10 -m venv .venv
Playwright Chromium won't launch (Linux)
Error : Error: browserType.launch: Host system is missing dependenciesSolution : Install system dependencies:sudo .venv/bin/python -m playwright install-deps chromium
This installs required system packages (fonts, libraries, etc.).
Permission denied when writing to data directory
Error : PermissionError: [Errno 13] Permission denied: './data'Solution : Change the data directory in .env:Or fix permissions: sudo chown -R $USER : $USER ./data
chmod -R 755 ./data
Database connection fails (PostgreSQL)
Error : sqlalchemy.exc.OperationalError: could not connect to serverSolutions :
Verify PostgreSQL is running:
sudo systemctl status postgresql
Check your DATABASE_URL in .env
Test connection:
psql -U scrapai_user -d scrapai -h localhost
Check PostgreSQL logs:
sudo tail -f /var/log/postgresql/postgresql- * .log
Command not found: ./scrapai (Linux/macOS)
Error : bash: ./scrapai: No such file or directorySolution : Make the script executable:chmod +x scrapai
./scrapai verify
Error : ERROR: This package requires Python 3.9 or higherSolution : Install a newer Python version:# Ubuntu/Debian
sudo apt install python3.10 python3.10-venv
# macOS
brew install python@3.10
# Then recreate the venv
rm -rf .venv
python3.10 -m venv .venv
./scrapai setup
Upgrading
To upgrade to the latest version:
git pull origin main
./scrapai setup # Re-run setup to install new dependencies
./scrapai db migrate # Apply any new database migrations
Always backup your database before upgrading: # SQLite
cp scrapai.db scrapai.db.backup
# PostgreSQL
pg_dump -U scrapai_user scrapai > scrapai_backup.sql
Uninstallation
To completely remove ScrapAI:
cd scrapai-cli
# Remove virtual environment
rm -rf .venv
# Remove database (if using SQLite)
rm scrapai.db
# Remove data directory
rm -rf data/
# Remove the repository
cd ..
rm -rf scrapai-cli/
Next Steps
Quick Start Build your first scraper in 5 minutes
CLI Reference Complete command reference
Configuration Advanced configuration options
GitHub Repository View source code and report issues