System Requirements
Python Version 3.9 or higher
Git For cloning the repository
Disk Space ~500 MB (dependencies + browser)
ScrapAI uses SQLite by default (no database installation required). For production scale, PostgreSQL is recommended.
Linux (Ubuntu, Debian, CentOS, Fedora, Arch)
macOS (Intel and Apple Silicon)
Windows (10/11 with WSL recommended, native PowerShell also supported)
Installation Steps
Install Python 3.9+
sudo apt update
sudo apt install python3.9 python3.9-venv python3-pip git
sudo dnf install python39 python39-pip git
sudo pacman -S python python-pip git
Verify installation: Install via Homebrew: brew install python@3.9 git
Or download from python.org Verify installation: Recommended : Use WSL (Windows Subsystem for Linux) for the best experience. Follow the Linux instructions after setting up WSL.
For native Windows:
Download Python 3.9+ from python.org
During installation, check “Add Python to PATH”
Install Git from git-scm.com
Verify in PowerShell: python -- version
git -- version
Clone the repository
git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
Clone to a location with write permissions. Avoid system directories like /usr/local/.
Run setup
The setup process will:
Create virtual environment
Creates a .venv directory with isolated Python packages
Install dependencies
Installs Scrapy, SQLAlchemy, Alembic, newspaper4k, trafilatura, Playwright, and more
Install Playwright Chromium
Downloads Chromium browser for JavaScript rendering and Cloudflare bypass Linux users : If Chromium fails to launch later, you may need to install system dependencies:sudo .venv/bin/python -m playwright install-deps chromium
This requires sudo because it installs system packages (fonts, libraries, etc.).
Create .env file
Copies .env.example to .env with default SQLite configuration
Initialize database
Runs Alembic migrations to create the database schema
Configure Claude Code permissions
If using AI agents, sets up permission rules in .claude/settings.local.json
🚀 Setting up ScrapAI environment...
📦 Creating virtual environment...
✅ Virtual environment created
📋 Installing requirements...
✅ Requirements installed
🌐 Installing Playwright Chromium browser...
✅ Playwright Chromium installed
📝 Creating .env from .env.example...
✅ .env file created (using SQLite by default)
📁 Checking data directory permissions...
✅ Have permission to write to data directory: ./data
🗄️ Initializing database...
✅ Database initialized with migrations
🔧 Configuring Claude Code permissions...
✅ Claude Code permissions configured
🎉 ScrapAI setup complete!
Verify installation
You should see: 🔍 Verifying ScrapAI environment...
✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!
If any checks fail, re-run ./scrapai setup. If issues persist, see Troubleshooting below.
Configuration
Database Configuration
By default, ScrapAI uses SQLite (file-based database, no setup required):
DATABASE_URL = sqlite:///scrapai.db
For production or larger scale, use PostgreSQL :
Install PostgreSQL
sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql
brew install postgresql
brew services start postgresql
Download from postgresql.org
Create database and user
CREATE DATABASE scrapai ;
CREATE USER scrapai_user WITH PASSWORD 'your_secure_password' ;
GRANT ALL PRIVILEGES ON DATABASE scrapai TO scrapai_user;
\q
Update .env
Edit .env and replace the SQLite URL: DATABASE_URL = postgresql://scrapai_user:your_secure_password@localhost:5432/scrapai
Transfer existing data (optional)
If you have data in SQLite: ./scrapai db transfer sqlite:///scrapai.db
Proxy Configuration
ScrapAI supports smart proxy escalation. Configure proxies in .env:
# Datacenter Proxy (recommended - faster, cheaper)
DATACENTER_PROXY_USERNAME = your_username
DATACENTER_PROXY_PASSWORD = your_password
DATACENTER_PROXY_HOST = your-datacenter-proxy.com
DATACENTER_PROXY_PORT = 10000 # Port 10000 = rotating IPs
# Residential Proxy (for sites that block datacenter IPs)
RESIDENTIAL_PROXY_USERNAME = your_username
RESIDENTIAL_PROXY_PASSWORD = your_password
RESIDENTIAL_PROXY_HOST = your-residential-proxy.com
RESIDENTIAL_PROXY_PORT = 7000 # Port 7000 = rotating residential IPs
Proxies are optional. ScrapAI starts with direct connections and only uses proxies when needed (403/429 errors). It learns which domains require proxies and remembers for future crawls.
S3 Storage Configuration
For automatic uploads to S3-compatible storage (Hetzner, DigitalOcean Spaces, Wasabi, Backblaze, etc.):
S3_ACCESS_KEY = your_access_key_here
S3_SECRET_KEY = your_secret_key_here
S3_ENDPOINT = https://your-s3-endpoint.com
S3_BUCKET = your-bucket-name
Troubleshooting
Virtual environment creation fails
Error : The virtual environment was not created successfullySolution :# Install venv module
sudo apt install python3.9-venv # Ubuntu/Debian
# Or use a different Python version
python3.10 -m venv .venv
Playwright Chromium won't launch (Linux)
Error : Error: browserType.launch: Host system is missing dependenciesSolution : Install system dependencies:sudo .venv/bin/python -m playwright install-deps chromium
This installs required system packages (fonts, libraries, etc.).
Permission denied when writing to data directory
Error : PermissionError: [Errno 13] Permission denied: './data'Solution : Change the data directory in .env:Or fix permissions: sudo chown -R $USER : $USER ./data
chmod -R 755 ./data
Database connection fails (PostgreSQL)
Error : sqlalchemy.exc.OperationalError: could not connect to serverSolutions :
Verify PostgreSQL is running:
sudo systemctl status postgresql
Check your DATABASE_URL in .env
Test connection:
psql -U scrapai_user -d scrapai -h localhost
Check PostgreSQL logs:
sudo tail -f /var/log/postgresql/postgresql- * .log
Command not found: ./scrapai (Linux/macOS)
Error : bash: ./scrapai: No such file or directorySolution : Make the script executable:chmod +x scrapai
./scrapai verify
Error : ERROR: This package requires Python 3.9 or higherSolution : Install a newer Python version:# Ubuntu/Debian
sudo apt install python3.10 python3.10-venv
# macOS
brew install python@3.10
# Then recreate the venv
rm -rf .venv
python3.10 -m venv .venv
./scrapai setup
Windows: Scripts are disabled
Error : cannot be loaded because running scripts is disabledSolution : Enable script execution in PowerShell:Set-ExecutionPolicy - ExecutionPolicy RemoteSigned - Scope CurrentUser
Upgrading
To upgrade to the latest version:
git pull origin main
./scrapai setup # Re-run setup to install new dependencies
./scrapai db migrate # Apply any new database migrations
Always backup your database before upgrading: # SQLite
cp scrapai.db scrapai.db.backup
# PostgreSQL
pg_dump -U scrapai_user scrapai > scrapai_backup.sql
Uninstallation
To completely remove ScrapAI:
cd scrapai-cli
# Remove virtual environment
rm -rf .venv
# Remove database (if using SQLite)
rm scrapai.db
# Remove data directory
rm -rf data/
# Remove the repository
cd ..
rm -rf scrapai-cli/
Next Steps