Installation

System Requirements

Python

Version 3.9 or higher

Git

For cloning the repository

Disk Space

~500 MB (dependencies + browser)

scrapai uses SQLite by default (no database installation required). For production scale, PostgreSQL is recommended.

Supported Platforms

Linux (Ubuntu, Debian, CentOS, Fedora, Arch)
macOS (Intel and Apple Silicon)
Windows (10/11 via WSL only)

Installation Steps

Install Python 3.9+

Linux
macOS
Windows

Ubuntu/Debian

sudo apt update
sudo apt install python3.9 python3.9-venv python3-pip git

CentOS/Fedora

sudo dnf install python39 python39-pip git

Arch Linux

sudo pacman -S python python-pip git

Verify installation:

python3 --version

Install via Homebrew:

brew install python@3.9 git

Or download from python.orgVerify installation:

python3 --version

Clone the repository

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli

Clone to a location with write permissions. Avoid system directories like /usr/local/.

Run setup

./scrapai setup

The setup process will:

Create virtual environment

Creates a .venv directory with isolated Python packages

Install dependencies

Installs Scrapy, SQLAlchemy, Alembic, newspaper4k, trafilatura, Playwright, and more

Install Playwright Chromium

Downloads Chromium browser for JavaScript rendering and Cloudflare bypass

Linux users: If Chromium fails to launch later, you may need to install system dependencies:

sudo .venv/bin/python -m playwright install-deps chromium

This requires sudo because it installs system packages (fonts, libraries, etc.).

Create .env file

Copies .env.example to .env with default SQLite configuration

Initialize database

Runs Alembic migrations to create the database schema

Configure Claude Code permissions

If using AI agents, sets up permission rules in .claude/settings.local.json

Expected output

🚀 Setting up scrapai environment...
📦 Creating virtual environment...
✅ Virtual environment created
📋 Installing requirements...
✅ Requirements installed
🌐 Installing Playwright Chromium browser...
✅ Playwright Chromium installed
📝 Creating .env from .env.example...
✅ .env file created (using SQLite by default)
📁 Checking data directory permissions...
✅ Have permission to write to data directory: ./data
🗄️  Initializing database...
✅ Database initialized with migrations
🔧 Configuring Claude Code permissions...
✅ Claude Code permissions configured
🎉 scrapai setup complete!

Verify installation

./scrapai verify

You should see:

🔍 Verifying scrapai environment...

✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized

🎉 Environment is ready!

If any checks fail, re-run ./scrapai setup. If issues persist, see Troubleshooting below.

Configuration

Database Configuration

scrapai uses SQLite by default (no setup required). For production, you can transfer your existing data to PostgreSQL:

Get PostgreSQL

Use a managed service (AWS RDS, DigitalOcean, Supabase) or install locally:

# Linux: sudo apt install postgresql postgresql-contrib
# macOS: brew install postgresql

Update .env

DATABASE_URL=postgresql://user:password@host:5432/scrapai

Run migrations and transfer

./scrapai db migrate
./scrapai db transfer sqlite:///scrapai.db

Proxy Configuration

scrapai supports smart proxy escalation. Configure proxies in .env:

# Datacenter Proxy (recommended - faster, cheaper)
DATACENTER_PROXY_USERNAME=your_username
DATACENTER_PROXY_PASSWORD=your_password
DATACENTER_PROXY_HOST=your-datacenter-proxy.com
DATACENTER_PROXY_PORT=10000  # Port 10000 = rotating IPs

# Residential Proxy (for sites that block datacenter IPs)
RESIDENTIAL_PROXY_USERNAME=your_username
RESIDENTIAL_PROXY_PASSWORD=your_password
RESIDENTIAL_PROXY_HOST=your-residential-proxy.com
RESIDENTIAL_PROXY_PORT=7000  # Port 7000 = rotating residential IPs

Proxies are optional. scrapai starts with direct connections and only uses proxies when needed (403/429 errors). It learns which domains require proxies and remembers for future crawls.

S3 Storage Configuration

For automatic uploads to S3-compatible storage (Hetzner, DigitalOcean Spaces, Wasabi, Backblaze, etc.):

S3_ACCESS_KEY=your_access_key_here
S3_SECRET_KEY=your_secret_key_here
S3_ENDPOINT=https://your-s3-endpoint.com
S3_BUCKET=your-bucket-name

Troubleshooting

Virtual environment creation fails

Error: The virtual environment was not created successfullySolution:

# Install venv module
sudo apt install python3.9-venv  # Ubuntu/Debian

# Or use a different Python version
python3.10 -m venv .venv

Playwright Chromium won't launch (Linux)

Error: Error: browserType.launch: Host system is missing dependenciesSolution: Install system dependencies:

sudo .venv/bin/python -m playwright install-deps chromium

This installs required system packages (fonts, libraries, etc.).

Permission denied when writing to data directory

Error: PermissionError: [Errno 13] Permission denied: './data'Solution: Change the data directory in .env:

DATA_DIR=~/scrapai-data

Or fix permissions:

sudo chown -R $USER:$USER ./data
chmod -R 755 ./data

Database connection fails (PostgreSQL)

Error: sqlalchemy.exc.OperationalError: could not connect to serverSolutions:

Verify PostgreSQL is running:
sudo systemctl status postgresql
Check your DATABASE_URL in .env

Test connection:

psql -U scrapai_user -d scrapai -h localhost

Check PostgreSQL logs:

sudo tail -f /var/log/postgresql/postgresql-*.log

Command not found: ./scrapai (Linux/macOS)

Error: bash: ./scrapai: No such file or directorySolution: Make the script executable:

chmod +x scrapai
./scrapai verify

Python version too old

Error: ERROR: This package requires Python 3.9 or higherSolution: Install a newer Python version:

# Ubuntu/Debian
sudo apt install python3.10 python3.10-venv

# macOS
brew install python@3.10

# Then recreate the venv
rm -rf .venv
python3.10 -m venv .venv
./scrapai setup

Upgrading

To upgrade to the latest version:

git pull origin main
./scrapai setup  # Re-run setup to install new dependencies
./scrapai db migrate  # Apply any new database migrations

Always backup your database before upgrading:

# SQLite
cp scrapai.db scrapai.db.backup

# PostgreSQL
pg_dump -U scrapai_user scrapai > scrapai_backup.sql

Uninstallation

To completely remove scrapai:

cd scrapai-cli

# Remove virtual environment
rm -rf .venv

# Remove database (if using SQLite)
rm scrapai.db

# Remove data directory
rm -rf data/

# Remove the repository
cd ..
rm -rf scrapai-cli/

Next Steps

Quick Start

Build your first scraper in 5 minutes

CLI Reference

Complete command reference

Configuration

Advanced configuration options

GitHub Repository

View source code and report issues

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

System Requirements

Python

Git

Disk Space

Supported Platforms

Installation Steps

Configuration

Database Configuration

Proxy Configuration

S3 Storage Configuration

Troubleshooting

Upgrading

Uninstallation

Next Steps

Quick Start

CLI Reference

Configuration

GitHub Repository

​System Requirements

Python

Git

Disk Space

​Supported Platforms

​Installation Steps

​Configuration

​Database Configuration

​Proxy Configuration

​S3 Storage Configuration

​Troubleshooting

​Upgrading

​Uninstallation

​Next Steps

Quick Start

CLI Reference

Configuration

GitHub Repository

System Requirements

Supported Platforms

Installation Steps

Configuration

Database Configuration

Proxy Configuration

S3 Storage Configuration

Troubleshooting

Upgrading

Uninstallation

Next Steps