Overview
ScrapAI CLI stores all crawl metadata, spider configurations, and queue information in a relational database. You can use SQLite for development or PostgreSQL for production.
Database Options
SQLite (Default)
Recommended for:
- Getting started
- Development
- Small to medium projects (< 1M items)
- Single-user usage
Advantages:
- No setup required
- Zero configuration
- Fast for small datasets
- File-based (easy backup)
- Built into Python
Limitations:
- Limited concurrency
- Single writer at a time
- Performance degrades with large datasets
- Not suitable for distributed systems
PostgreSQL
Recommended for:
- Production deployments
- Large projects (1M+ items)
- Multi-user environments
- Distributed crawling
- High concurrency needs
Advantages:
- Excellent concurrency
- Scales to billions of rows
- Advanced indexing
- ACID compliance
- Industry standard
Requirements:
- PostgreSQL 12+ installed
- Database server running
- User credentials
SQLite Configuration
Default Setup
SQLite is configured by default. No action required:
This creates scrapai.db in the project root.
Custom SQLite Path
To use a different database file, update .env:
# Relative path
DATABASE_URL=sqlite:///scrapai.db
# Absolute path
DATABASE_URL=sqlite:////absolute/path/to/scrapai.db
# Custom location
DATABASE_URL=sqlite:///./data/scrapai.db
Use three slashes /// for relative paths and four slashes //// for absolute paths.
SQLite Optimization
ScrapAI automatically configures SQLite with optimized settings:
PRAGMA journal_mode=WAL # Write-Ahead Logging (better concurrency)
PRAGMA synchronous=NORMAL # Balance safety and performance
PRAGMA cache_size=-64000 # 64MB cache
These settings are applied automatically in core/db.py:17.
PostgreSQL Configuration
Installation
macOS
Ubuntu/Debian
Docker
# Install PostgreSQL
brew install postgresql@15
# Start service
brew services start postgresql@15
# Install PostgreSQL
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
# Start service
sudo systemctl start postgresql
sudo systemctl enable postgresql
# Run PostgreSQL container
docker run -d \
--name scrapai-postgres \
-e POSTGRES_PASSWORD=yourpassword \
-e POSTGRES_DB=scrapai \
-p 5432:5432 \
postgres:15
Create Database
# Create database
creatdb scrapai
# Or with custom user
psql -U postgres
CREATE DATABASE scrapai;
CREATE USER scrapai_user WITH PASSWORD 'secure_password';
GRANT ALL PRIVILEGES ON DATABASE scrapai TO scrapai_user;
Update .env with PostgreSQL connection string:
DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
Connection String Format:
postgresql://[user]:[password]@[host]:[port]/[database]
Examples:
# Local PostgreSQL
DATABASE_URL=postgresql://scrapai_user:secure_password@localhost:5432/scrapai
# Remote PostgreSQL
DATABASE_URL=postgresql://user:pass@db.example.com:5432/scrapai
# PostgreSQL with SSL
DATABASE_URL=postgresql://user:pass@db.example.com:5432/scrapai?sslmode=require
# Cloud providers (RDS, Cloud SQL, etc.)
DATABASE_URL=postgresql://admin:pass@rds-instance.region.rds.amazonaws.com:5432/scrapai
Run Migrations
After configuring PostgreSQL, initialize the database schema:
This creates all required tables and indexes.
Migrating from SQLite to PostgreSQL
Backup your data before migrating!Create a backup of your SQLite database:cp scrapai.db scrapai.db.backup
Migration Steps
-
Install and configure PostgreSQL (see above)
-
Update
.env with PostgreSQL URL:
# Before
DATABASE_URL=sqlite:///scrapai.db
# After
DATABASE_URL=postgresql://user:password@localhost:5432/scrapai
-
Run migrations to create schema:
-
Transfer data from SQLite:
./scrapai db transfer sqlite:///scrapai.db
-
Verify transfer:
Large databases: Skip scraped items to speed up transfer:./scrapai db transfer sqlite:///scrapai.db --skip-items
This transfers spider configs and metadata but not crawled data.
Transfer Process
The db transfer command:
- Reads from source database (first argument)
- Writes to target database (current
DATABASE_URL in .env)
- Transfers all tables and data
- Preserves relationships and indexes
Important: Update DATABASE_URL in .env BEFORE running transfer. The command reads from the source URL you provide and writes to whatever DATABASE_URL is currently set.
Database Maintenance
Backup
# Simple file copy
cp scrapai.db scrapai.db.backup
# With timestamp
cp scrapai.db scrapai.db.$(date +%Y%m%d_%H%M%S)
# SQLite backup command
sqlite3 scrapai.db ".backup scrapai.db.backup"
# Dump database
pg_dump scrapai > scrapai_backup.sql
# With timestamp
pg_dump scrapai > scrapai_$(date +%Y%m%d_%H%M%S).sql
# Compressed backup
pg_dump scrapai | gzip > scrapai_backup.sql.gz
# Custom format (faster restore)
pg_dump -Fc scrapai > scrapai_backup.dump
Restore
# Stop ScrapAI
# Replace database file
cp scrapai.db.backup scrapai.db
# Or restore from backup
sqlite3 scrapai.db ".restore scrapai.db.backup"
# From SQL dump
psql scrapai < scrapai_backup.sql
# From compressed backup
gunzip -c scrapai_backup.sql.gz | psql scrapai
# From custom format
pg_restore -d scrapai scrapai_backup.dump
Vacuum and Optimize
# Optimize database
sqlite3 scrapai.db "VACUUM;"
# Analyze for query optimization
sqlite3 scrapai.db "ANALYZE;"
# Vacuum (reclaim space)
psql scrapai -c "VACUUM;"
# Full vacuum (locks tables)
psql scrapai -c "VACUUM FULL;"
# Analyze for query optimization
psql scrapai -c "ANALYZE;"
# Combined
psql scrapai -c "VACUUM ANALYZE;"
Database Schema
ScrapAI uses SQLAlchemy ORM with Alembic migrations. Schema is defined in:
core/models.py - Table definitions
alembic/versions/ - Migration scripts
Key Tables
- spiders - Spider configurations and rules
- projects - Project metadata
- crawls - Crawl execution history
- items - Scraped articles/items
- queue - Crawl queue for batch processing
- analysis - Analysis results and patterns
Viewing Schema
# List tables
sqlite3 scrapai.db ".tables"
# Show schema
sqlite3 scrapai.db ".schema"
# Describe table
sqlite3 scrapai.db ".schema spiders"
# List tables
psql scrapai -c "\dt"
# Describe table
psql scrapai -c "\d spiders"
# Full schema
pg_dump scrapai --schema-only
SQLite Tuning
ScrapAI automatically applies optimal SQLite settings, but for extreme performance:
# In core/db.py, increase cache size:
PRAGMA cache_size=-128000 # 128MB cache (default: 64MB)
PostgreSQL Tuning
Edit PostgreSQL configuration (postgresql.conf):
# Memory settings (adjust based on available RAM)
shared_buffers = 256MB # 25% of RAM
effective_cache_size = 1GB # 50-75% of RAM
work_mem = 16MB # Per-operation memory
maintenance_work_mem = 128MB # For VACUUM, CREATE INDEX
# Connection settings
max_connections = 100
# Query planning
random_page_cost = 1.1 # For SSD storage
Restart PostgreSQL after changes:
sudo systemctl restart postgresql
Troubleshooting
Database connection failed
SQLite:
# Check file exists
ls -la scrapai.db
# Check permissions
chmod 644 scrapai.db
# Recreate database
rm scrapai.db scrapai.db-*
./scrapai setup
PostgreSQL:
# Test connection
psql -U user -d scrapai
# Check service status
sudo systemctl status postgresql
# Verify connection string
echo $DATABASE_URL
Database locked (SQLite)
# Find processes using database
lsof scrapai.db
# Kill stuck processes
pkill -f scrapai
# Clear WAL files
rm scrapai.db-wal scrapai.db-shm
Migration failed
# Check current version
./scrapai db version
# Reset migrations (WARNING: loses data)
rm scrapai.db
./scrapai setup
# Or fix manually
alembic upgrade head
Transfer failed
# Check source database is readable
sqlite3 scrapai.db "SELECT COUNT(*) FROM spiders;"
# Verify target PostgreSQL is accessible
psql $DATABASE_URL -c "SELECT 1;"
# Try without items
./scrapai db transfer sqlite:///scrapai.db --skip-items
Security
PostgreSQL Security
-
Use strong passwords:
ALTER USER scrapai_user WITH PASSWORD 'very_secure_random_password';
-
Restrict network access (
pg_hba.conf):
# Local connections only
host scrapai scrapai_user 127.0.0.1/32 md5
-
Enable SSL:
DATABASE_URL=postgresql://user:pass@host:5432/scrapai?sslmode=require
-
Regular backups:
# Automated daily backups
0 2 * * * pg_dump scrapai | gzip > /backups/scrapai_$(date +\%Y\%m\%d).sql.gz