Architecture
ScrapAI uses a project-based organization model:- Projects: Logical groupings of spiders (e.g.,
news,ecommerce,research) - Spiders: JSON configurations stored in the database
- Queue: Database-backed queue for batch processing
- Data: Test mode saves to database, production mode exports to JSONL files
Entry Point
Thescrapai script automatically activates the virtual environment and delegates to the CLI:
All commands run through the
scrapai wrapper, which handles virtual environment activation automatically.Command Categories
Setup & Verification
Install dependencies, configure environment, verify setup
Spider Management
List, import, delete, and manage spider configurations
Crawling
Run spiders in test or production mode with checkpoint support
Queue Management
Add URLs, bulk import, process items in parallel batches
Data Operations
View scraped items, export to CSV/JSON/Parquet
Inspection
Analyze websites for scraper development
Database
Migrations, queries, statistics, data transfer
Projects
List and manage project configurations
Global Conventions
Project Names
Most commands require a--project flag to specify the project context:
Output Modes
Test Mode (with--limit):
- Saves scraped items to database
- Limited number of items
- Use
showcommand to view results - No HTML content stored
- Exports to timestamped JSONL files in
data/<project>/<spider>/crawls/ - Includes full HTML content
- Enables checkpoint pause/resume
- Database writes disabled for performance
File Paths
All data is stored under theDATA_DIR configured in .env (default: ./data):
Common Workflows
Quick Test
Test a spider on 5-10 URLs to verify extraction:Production Crawl
Run a full crawl with checkpoint support:Batch Processing
Add multiple websites to queue and process:Export Data
Export scraped data in various formats:Platform Support
- Linux: Full support including headless Cloudflare bypass with xvfb
- macOS: Full support
- Windows: Full support (use
scrapai.batorscrapaidirectly)
Database Support
- SQLite: Default, zero configuration
- PostgreSQL: Production deployments, atomic queue operations
DATABASE_URL in .env and running migrations.