CLI Overview - scrapai

The scrapai CLI provides a command-line interface for building, managing, and running AI-powered web scrapers. Spiders are stored as JSON configurations in the database.

Architecture

scrapai uses a project-based organization model:

Projects: Logical groupings of spiders (e.g., news, ecommerce, research)
Spiders: JSON configurations stored in the database
Queue: Database-backed queue for batch processing
Data: Test mode saves to database, production mode exports to JSONL files

Entry Point

The scrapai script automatically activates the virtual environment and delegates to the CLI:

# Linux/macOS
./scrapai <command> [options]

# Windows
scrapai <command> [options]

Command Categories

Setup & Verification

Install dependencies, configure environment, verify setup

Spider Management

List, import, delete, and manage spider configurations

Crawling

Run spiders in test or production mode with checkpoint support

Queue Management

Add URLs, bulk import, process items in parallel batches

Data Operations

View scraped items, export to CSV/JSON/Parquet

Inspection

Analyze websites for scraper development

Database

Migrations, queries, statistics, data transfer

Projects

List and manage project configurations

Utility & Diagnostic Commands

Standalone commands for authentication, browser management, selector discovery, and crawl monitoring.

session

Capture and manage login sessions — scrapai never types your password. Subcommands: login, check, list, remove.

browser

Manage the persistent browser service (a warm browser reused across inspect/screenshot calls). Subcommands: start, stop, restart, status, shot.

analyze

Analyze a local HTML file for CSS selector discovery (--test, --find, --find-text).

try

Run newspaper and trafilatura against a local HTML file and compare their output.

extract-urls

Extract all URLs from an HTML file (--file, optional --output).

health

Test all spiders in a project and generate a report for broken ones.

crawl-status

Show each crawl’s run state and how much it has downloaded (requires Pueue for detached crawls).

Global Conventions

Project Names

Most commands require a --project flag to specify the project context:

./scrapai spiders list --project news
./scrapai crawl bbc_co_uk --project news

Default project is default if not specified.

Output Modes

Test Mode (with --limit):

Saves scraped items to database
Limited number of items
Use show command to view results
No HTML content stored

Production Mode (no limit):

Exports to timestamped JSONL files in data/<project>/<spider>/crawls/
Includes full HTML content
Enables checkpoint pause/resume
Database writes disabled for performance

File Paths

All data is stored under the DATA_DIR configured in .env (default: ./data):

data/
├── <project>/
│   ├── <spider>/
│   │   ├── crawls/         # Production JSONL exports
│   │   ├── exports/        # Manual exports (CSV/JSON/Parquet)
│   │   └── checkpoint/     # Pause/resume state

Common Workflows

Quick Test

Test a spider on 5-10 URLs to verify extraction:

./scrapai crawl myspider --project myproject --limit 5
./scrapai show myspider --project myproject

Production Crawl

Run a full crawl with checkpoint support:

./scrapai crawl myspider --project myproject
# Press Ctrl+C to pause
# Run same command to resume

Batch Processing

Add multiple websites to queue and process:

./scrapai queue bulk urls.csv --project myproject
./scrapai queue list --project myproject
./scrapai queue next --project myproject  # Claim next item

Export Data

Export scraped data in various formats:

./scrapai export myspider --project myproject --format csv
./scrapai export myspider --project myproject --format parquet

Platform Support

Linux: Full support including headless Cloudflare bypass with xvfb
macOS: Full support
Windows: Full support (use scrapai.bat or scrapai directly)

Database Support

SQLite: Default, zero configuration
PostgreSQL: Production deployments, atomic queue operations

​Architecture

​Entry Point

​Command Categories

Setup & Verification

Spider Management

Crawling

Queue Management

Data Operations

Inspection

Database

Projects

​Utility & Diagnostic Commands

session

browser

analyze

try

extract-urls

health

crawl-status

​Global Conventions

​Project Names

​Output Modes

​File Paths

​Common Workflows

​Quick Test

​Production Crawl

​Batch Processing

​Export Data

​Platform Support

​Database Support

Architecture

Entry Point

Command Categories

Utility & Diagnostic Commands

Global Conventions

Project Names

Output Modes

File Paths

Common Workflows

Quick Test

Production Crawl

Batch Processing

Export Data

Platform Support

Database Support