Skip to main content

Overview

ScrapAI stores crawl metadata, spider configurations, and queue data in a relational database.

Database Options

SQLite (Default)

Best for: Development, small to medium projects (< 1M items), single-user usage. Pros: No setup, zero configuration, file-based, built into Python. Cons: Limited concurrency, single writer, not suitable for distributed systems.

PostgreSQL

Best for: Production, large projects (1M+ items), multi-user environments, distributed crawling. Pros: Excellent concurrency, scales to billions of rows, advanced indexing. Requires: PostgreSQL 12+ installed and running.

SQLite Configuration

Default Setup

SQLite is configured by default. Run ./scrapai setup to create scrapai.db in the project root.

Custom Path

Update .env to use a different location:
DATABASE_URL=sqlite:///scrapai.db                    # Relative
DATABASE_URL=sqlite:////absolute/path/to/scrapai.db  # Absolute
Use three slashes /// for relative paths and four slashes //// for absolute paths.

Optimization

ScrapAI automatically applies optimized SQLite settings (WAL mode, 64MB cache) in core/db.py.

PostgreSQL Configuration

Installation

# Install PostgreSQL
brew install postgresql@15

# Start service
brew services start postgresql@15

Create Database

createdb scrapai
Or with custom user:
CREATE DATABASE scrapai;
CREATE USER scrapai_user WITH PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE scrapai TO scrapai_user;

Configure Connection

Update .env with your connection string:
DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
Format: postgresql://[user]:[password]@[host]:[port]/[database] Add ?sslmode=require for SSL connections.

Run Migrations

Initialize the database schema:
./scrapai db migrate

Migrating from SQLite to PostgreSQL

Backup your data first: cp scrapai.db scrapai.db.backup

Steps

  1. Install and configure PostgreSQL
  2. Update .env with PostgreSQL URL
  3. Run migrations: ./scrapai db migrate
  4. Transfer data: ./scrapai db transfer sqlite:///scrapai.db
  5. Verify: ./scrapai verify
For large databases, use --skip-items to transfer only configs and metadata.
The transfer command reads from the source URL (argument) and writes to the current DATABASE_URL in .env.

Database Maintenance

Backup

cp scrapai.db scrapai.db.backup
cp scrapai.db scrapai.db.$(date +%Y%m%d_%H%M%S)

Restore

cp scrapai.db.backup scrapai.db

Optimize

sqlite3 scrapai.db "VACUUM; ANALYZE;"

Database Schema

ScrapAI uses SQLAlchemy ORM with Alembic migrations. Schema defined in core/models.py. Key Tables: spiders, projects, crawls, items, queue, analysis

View Schema

sqlite3 scrapai.db ".schema"

Performance Tuning

SQLite

ScrapAI automatically applies optimal settings. For extreme performance, increase cache size in core/db.py.

PostgreSQL

Edit postgresql.conf to tune memory settings:
shared_buffers = 256MB          # 25% of RAM
effective_cache_size = 1GB      # 50-75% of RAM
work_mem = 16MB
random_page_cost = 1.1          # For SSD
Restart: sudo systemctl restart postgresql

Troubleshooting

Connection Failed

SQLite: Check file exists and permissions. Recreate with rm scrapai.db* && ./scrapai setup. PostgreSQL: Test connection with psql -U user -d scrapai. Check service status and verify DATABASE_URL.

Database Locked (SQLite)

lsof scrapai.db           # Find processes
pkill -f scrapai          # Kill stuck processes
rm scrapai.db-wal *.db-shm  # Clear WAL files

Migration Failed

Check version with ./scrapai db version. Reset with rm scrapai.db && ./scrapai setup or run alembic upgrade head.

Transfer Failed

Verify source is readable and target is accessible. Try with --skip-items.

Security

PostgreSQL Security

  1. Use strong passwords
  2. Restrict network access in pg_hba.conf
  3. Enable SSL: Add ?sslmode=require to DATABASE_URL
  4. Set up automated backups