Skip to main content

System Requirements

Python

Version 3.9 or higher

Git

For cloning the repository

Disk Space

~500 MB (dependencies + browser)
ScrapAI uses SQLite by default (no database installation required). For production scale, PostgreSQL is recommended.

Supported Platforms

  • Linux (Ubuntu, Debian, CentOS, Fedora, Arch)
  • macOS (Intel and Apple Silicon)
  • Windows (10/11 with WSL recommended, native PowerShell also supported)

Installation Steps

1

Install Python 3.9+

sudo apt update
sudo apt install python3.9 python3.9-venv python3-pip git
sudo dnf install python39 python39-pip git
sudo pacman -S python python-pip git
Verify installation:
python3 --version
2

Clone the repository

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
Clone to a location with write permissions. Avoid system directories like /usr/local/.
3

Run setup

./scrapai setup
The setup process will:
1

Create virtual environment

Creates a .venv directory with isolated Python packages
2

Install dependencies

Installs Scrapy, SQLAlchemy, Alembic, newspaper4k, trafilatura, Playwright, and more
3

Install Playwright Chromium

Downloads Chromium browser for JavaScript rendering and Cloudflare bypass
Linux users: If Chromium fails to launch later, you may need to install system dependencies:
sudo .venv/bin/python -m playwright install-deps chromium
This requires sudo because it installs system packages (fonts, libraries, etc.).
4

Create .env file

Copies .env.example to .env with default SQLite configuration
5

Initialize database

Runs Alembic migrations to create the database schema
6

Configure Claude Code permissions

If using AI agents, sets up permission rules in .claude/settings.local.json
🚀 Setting up ScrapAI environment...
📦 Creating virtual environment...
✅ Virtual environment created
📋 Installing requirements...
✅ Requirements installed
🌐 Installing Playwright Chromium browser...
✅ Playwright Chromium installed
📝 Creating .env from .env.example...
✅ .env file created (using SQLite by default)
📁 Checking data directory permissions...
✅ Have permission to write to data directory: ./data
🗄️  Initializing database...
✅ Database initialized with migrations
🔧 Configuring Claude Code permissions...
✅ Claude Code permissions configured
🎉 ScrapAI setup complete!
4

Verify installation

./scrapai verify
You should see:
🔍 Verifying ScrapAI environment...

✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized

🎉 Environment is ready!
If any checks fail, re-run ./scrapai setup. If issues persist, see Troubleshooting below.

Configuration

Database Configuration

By default, ScrapAI uses SQLite (file-based database, no setup required):
DATABASE_URL=sqlite:///scrapai.db
For production or larger scale, use PostgreSQL:
1

Install PostgreSQL

sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql
2

Create database and user

sudo -u postgres psql
CREATE DATABASE scrapai;
CREATE USER scrapai_user WITH PASSWORD 'your_secure_password';
GRANT ALL PRIVILEGES ON DATABASE scrapai TO scrapai_user;
\q
3

Update .env

Edit .env and replace the SQLite URL:
DATABASE_URL=postgresql://scrapai_user:your_secure_password@localhost:5432/scrapai
4

Run migrations

./scrapai db migrate
5

Transfer existing data (optional)

If you have data in SQLite:
./scrapai db transfer sqlite:///scrapai.db

Proxy Configuration

ScrapAI supports smart proxy escalation. Configure proxies in .env:
# Datacenter Proxy (recommended - faster, cheaper)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=your-datacenter-proxy.com
DATACENTER_PROXY_PORT=10000  # Port 10000 = rotating IPs

# Residential Proxy (for sites that block datacenter IPs)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=your-residential-proxy.com
RESIDENTIAL_PROXY_PORT=7000  # Port 7000 = rotating residential IPs
Proxies are optional. ScrapAI starts with direct connections and only uses proxies when needed (403/429 errors). It learns which domains require proxies and remembers for future crawls.

S3 Storage Configuration

For automatic uploads to S3-compatible storage (Hetzner, DigitalOcean Spaces, Wasabi, Backblaze, etc.):
S3_ACCESS_KEY=your_access_key_here
S3_SECRET_KEY=your_secret_key_here
S3_ENDPOINT=https://your-s3-endpoint.com
S3_BUCKET=your-bucket-name

Troubleshooting

Error: The virtual environment was not created successfullySolution:
# Install venv module
sudo apt install python3.9-venv  # Ubuntu/Debian

# Or use a different Python version
python3.10 -m venv .venv
Error: Error: browserType.launch: Host system is missing dependenciesSolution: Install system dependencies:
sudo .venv/bin/python -m playwright install-deps chromium
This installs required system packages (fonts, libraries, etc.).
Error: PermissionError: [Errno 13] Permission denied: './data'Solution: Change the data directory in .env:
DATA_DIR=~/scrapai-data
Or fix permissions:
sudo chown -R $USER:$USER ./data
chmod -R 755 ./data
Error: sqlalchemy.exc.OperationalError: could not connect to serverSolutions:
  1. Verify PostgreSQL is running:
    sudo systemctl status postgresql
    
  2. Check your DATABASE_URL in .env
  3. Test connection:
    psql -U scrapai_user -d scrapai -h localhost
    
  4. Check PostgreSQL logs:
    sudo tail -f /var/log/postgresql/postgresql-*.log
    
Error: bash: ./scrapai: No such file or directorySolution: Make the script executable:
chmod +x scrapai
./scrapai verify
Error: ERROR: This package requires Python 3.9 or higherSolution: Install a newer Python version:
# Ubuntu/Debian
sudo apt install python3.10 python3.10-venv

# macOS
brew install python@3.10

# Then recreate the venv
rm -rf .venv
python3.10 -m venv .venv
./scrapai setup
Error: cannot be loaded because running scripts is disabledSolution: Enable script execution in PowerShell:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Upgrading

To upgrade to the latest version:
git pull origin main
./scrapai setup  # Re-run setup to install new dependencies
./scrapai db migrate  # Apply any new database migrations
Always backup your database before upgrading:
# SQLite
cp scrapai.db scrapai.db.backup

# PostgreSQL
pg_dump -U scrapai_user scrapai > scrapai_backup.sql

Uninstallation

To completely remove ScrapAI:
cd scrapai-cli

# Remove virtual environment
rm -rf .venv

# Remove database (if using SQLite)
rm scrapai.db

# Remove data directory
rm -rf data/

# Remove the repository
cd ..
rm -rf scrapai-cli/

Next Steps