Skip to main content

Overview

ScrapAI CLI stores all crawl metadata, spider configurations, and queue information in a relational database. You can use SQLite for development or PostgreSQL for production.

Database Options

SQLite (Default)

Recommended for:
  • Getting started
  • Development
  • Small to medium projects (< 1M items)
  • Single-user usage
Advantages:
  • No setup required
  • Zero configuration
  • Fast for small datasets
  • File-based (easy backup)
  • Built into Python
Limitations:
  • Limited concurrency
  • Single writer at a time
  • Performance degrades with large datasets
  • Not suitable for distributed systems

PostgreSQL

Recommended for:
  • Production deployments
  • Large projects (1M+ items)
  • Multi-user environments
  • Distributed crawling
  • High concurrency needs
Advantages:
  • Excellent concurrency
  • Scales to billions of rows
  • Advanced indexing
  • ACID compliance
  • Industry standard
Requirements:
  • PostgreSQL 12+ installed
  • Database server running
  • User credentials

SQLite Configuration

Default Setup

SQLite is configured by default. No action required:
./scrapai setup
This creates scrapai.db in the project root.

Custom SQLite Path

To use a different database file, update .env:
# Relative path
DATABASE_URL=sqlite:///scrapai.db

# Absolute path
DATABASE_URL=sqlite:////absolute/path/to/scrapai.db

# Custom location
DATABASE_URL=sqlite:///./data/scrapai.db
Use three slashes /// for relative paths and four slashes //// for absolute paths.

SQLite Optimization

ScrapAI automatically configures SQLite with optimized settings:
PRAGMA journal_mode=WAL      # Write-Ahead Logging (better concurrency)
PRAGMA synchronous=NORMAL    # Balance safety and performance
PRAGMA cache_size=-64000     # 64MB cache
These settings are applied automatically in core/db.py:17.

PostgreSQL Configuration

Installation

# Install PostgreSQL
brew install postgresql@15

# Start service
brew services start postgresql@15

Create Database

# Create database
creatdb scrapai

# Or with custom user
psql -U postgres
CREATE DATABASE scrapai;
CREATE USER scrapai_user WITH PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE scrapai TO scrapai_user;

Configure Connection

Update .env with PostgreSQL connection string:
DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
Connection String Format:
postgresql://[user]:[password]@[host]:[port]/[database]
Examples:
# Local PostgreSQL
DATABASE_URL=postgresql://scrapai_user:secure_password@localhost:5432/scrapai

# Remote PostgreSQL
DATABASE_URL=postgresql://user:pass@db.example.com:5432/scrapai

# PostgreSQL with SSL
DATABASE_URL=postgresql://user:pass@db.example.com:5432/scrapai?sslmode=require

# Cloud providers (RDS, Cloud SQL, etc.)
DATABASE_URL=postgresql://admin:pass@rds-instance.region.rds.amazonaws.com:5432/scrapai

Run Migrations

After configuring PostgreSQL, initialize the database schema:
./scrapai db migrate
This creates all required tables and indexes.

Migrating from SQLite to PostgreSQL

Backup your data before migrating!Create a backup of your SQLite database:
cp scrapai.db scrapai.db.backup

Migration Steps

  1. Install and configure PostgreSQL (see above)
  2. Update .env with PostgreSQL URL:
    # Before
    DATABASE_URL=sqlite:///scrapai.db
    
    # After
    DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
    
  3. Run migrations to create schema:
    ./scrapai db migrate
    
  4. Transfer data from SQLite:
    ./scrapai db transfer sqlite:///scrapai.db
    
  5. Verify transfer:
    ./scrapai verify
    
Large databases: Skip scraped items to speed up transfer:
./scrapai db transfer sqlite:///scrapai.db --skip-items
This transfers spider configs and metadata but not crawled data.

Transfer Process

The db transfer command:
  1. Reads from source database (first argument)
  2. Writes to target database (current DATABASE_URL in .env)
  3. Transfers all tables and data
  4. Preserves relationships and indexes
Important: Update DATABASE_URL in .env BEFORE running transfer. The command reads from the source URL you provide and writes to whatever DATABASE_URL is currently set.

Database Maintenance

Backup

# Simple file copy
cp scrapai.db scrapai.db.backup

# With timestamp
cp scrapai.db scrapai.db.$(date +%Y%m%d_%H%M%S)

# SQLite backup command
sqlite3 scrapai.db ".backup scrapai.db.backup"

Restore

# Stop ScrapAI
# Replace database file
cp scrapai.db.backup scrapai.db

# Or restore from backup
sqlite3 scrapai.db ".restore scrapai.db.backup"

Vacuum and Optimize

# Optimize database
sqlite3 scrapai.db "VACUUM;"

# Analyze for query optimization
sqlite3 scrapai.db "ANALYZE;"

Database Schema

ScrapAI uses SQLAlchemy ORM with Alembic migrations. Schema is defined in:
  • core/models.py - Table definitions
  • alembic/versions/ - Migration scripts

Key Tables

  • spiders - Spider configurations and rules
  • projects - Project metadata
  • crawls - Crawl execution history
  • items - Scraped articles/items
  • queue - Crawl queue for batch processing
  • analysis - Analysis results and patterns

Viewing Schema

# List tables
sqlite3 scrapai.db ".tables"

# Show schema
sqlite3 scrapai.db ".schema"

# Describe table
sqlite3 scrapai.db ".schema spiders"

Performance Tuning

SQLite Tuning

ScrapAI automatically applies optimal SQLite settings, but for extreme performance:
# In core/db.py, increase cache size:
PRAGMA cache_size=-128000    # 128MB cache (default: 64MB)

PostgreSQL Tuning

Edit PostgreSQL configuration (postgresql.conf):
# Memory settings (adjust based on available RAM)
shared_buffers = 256MB          # 25% of RAM
effective_cache_size = 1GB      # 50-75% of RAM
work_mem = 16MB                 # Per-operation memory
maintenance_work_mem = 128MB    # For VACUUM, CREATE INDEX

# Connection settings
max_connections = 100

# Query planning
random_page_cost = 1.1          # For SSD storage
Restart PostgreSQL after changes:
sudo systemctl restart postgresql

Troubleshooting

Database connection failed

SQLite:
# Check file exists
ls -la scrapai.db

# Check permissions
chmod 644 scrapai.db

# Recreate database
rm scrapai.db scrapai.db-*
./scrapai setup
PostgreSQL:
# Test connection
psql -U user -d scrapai

# Check service status
sudo systemctl status postgresql

# Verify connection string
echo $DATABASE_URL

Database locked (SQLite)

# Find processes using database
lsof scrapai.db

# Kill stuck processes
pkill -f scrapai

# Clear WAL files
rm scrapai.db-wal scrapai.db-shm

Migration failed

# Check current version
./scrapai db version

# Reset migrations (WARNING: loses data)
rm scrapai.db
./scrapai setup

# Or fix manually
alembic upgrade head

Transfer failed

# Check source database is readable
sqlite3 scrapai.db "SELECT COUNT(*) FROM spiders;"

# Verify target PostgreSQL is accessible
psql $DATABASE_URL -c "SELECT 1;"

# Try without items
./scrapai db transfer sqlite:///scrapai.db --skip-items

Security

PostgreSQL Security

  1. Use strong passwords:
    ALTER USER scrapai_user WITH PASSWORD 'very_secure_random_password';
    
  2. Restrict network access (pg_hba.conf):
    # Local connections only
    host    scrapai    scrapai_user    127.0.0.1/32    md5
    
  3. Enable SSL:
    DATABASE_URL=postgresql://user:pass@host:5432/scrapai?sslmode=require
    
  4. Regular backups:
    # Automated daily backups
    0 2 * * * pg_dump scrapai | gzip > /backups/scrapai_$(date +\%Y\%m\%d).sql.gz