Quick Start

Prerequisites

Before you begin, ensure you have:

Python 3.9 or higher
Git
Terminal access

scrapai works on Linux, macOS, and Windows. The setup process is identical across all platforms.

Installation

Clone the repository

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli

Run setup

./scrapai setup

This sets up your environment, installs dependencies, and initializes the database.

On Windows, use scrapai setup instead of ./scrapai setup.

On Linux, if Chromium fails to launch, install system dependencies:

sudo .venv/bin/python -m playwright install-deps chromium

Verify installation

./scrapai verify

You should see:

✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!

Your First Scraper

Let’s import and run a pre-built spider for BBC News.

Import the spider

scrapai includes example spiders in the templates/ directory. Let’s import the BBC News spider:

./scrapai spiders import templates/news/bbc_co_uk/analysis/final_spider.json --project news

Run a test crawl

Run the spider in test mode (limits to 5 items):

./scrapai crawl bbc_co_uk --project news --limit 5

Test mode (--limit) stores data in the database for inspection. Production mode (no limit) exports to timestamped JSONL files.

You’ll see Scrapy crawling in action:

2026-02-28 14:30:12 [scrapy.core.engine] INFO: Spider opened
2026-02-28 14:30:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bbc.co.uk/news/articles/...>
{'title': 'Breaking news story title', 'content': '...', 'author': 'BBC News', ...}

View the results

Inspect the scraped data:

./scrapai show bbc_co_uk --project news

Export the data

Export to your preferred format:

./scrapai export bbc_co_uk --project news --format csv

./scrapai export bbc_co_uk --project news --format json

./scrapai export bbc_co_uk --project news --format jsonl

./scrapai export bbc_co_uk --project news --format parquet

Exports are saved to the data/ directory with timestamps.

Explore More Examples

scrapai includes several ready-to-use spider templates:

E-Commerce

./scrapai spiders import templates/ecommerce/amazon_co_uk_mac_accessories/analysis/final_spider.json --project shop

Scrapes product listings with prices, ratings, and descriptions

Forums

./scrapai spiders import templates/forums/news_ycombinator_com/analysis/final_spider.json --project forums

Extracts discussion threads, authors, and timestamps

Cloudflare-Protected

./scrapai spiders import templates/cloudflare/thefga_org/analysis/final_spider.json --project research

Demonstrates Cloudflare bypass with cookie caching

Real Estate

./scrapai spiders import templates/spider-realestate.json --project housing

Property listings with custom field extractors

Using with AI Agents

scrapai is designed to work with AI coding agents like Claude Code. Instead of manually writing JSON configs, you describe what you want in plain English:

claude

You: "Add https://techcrunch.com to my news project"
Agent: [Analyzes site, generates rules, tests extraction, deploys spider]

You: "Crawl all spiders in my news project and export to CSV"
Agent: [Executes crawls, exports data]

The ./scrapai setup command automatically configures Claude Code permissions to prevent the agent from modifying framework code—it can only write JSON configs and run CLI commands.

Production Crawling

Run a full crawl by omitting --limit:

./scrapai crawl bbc_co_uk --project news

A full crawl auto-detaches into Pueue, so it keeps running after you disconnect from SSH. scrapai queues the task, prints its Pueue task ID, and returns immediately:

Production crawl 'bbc_co_uk' queued in Pueue (task 3); survives SSH disconnect.
  progress: pueue log 3   all: pueue status   stop: pueue kill 3

Detached crawls require Pueue. If it isn’t installed, scrapai stops and tells you to install it (or test with --limit N).

Check progress at any time with crawl-status:

./scrapai crawl-status bbc_co_uk --project news

This joins each crawl’s Pueue run state (running, queued, done, …) with its output file, showing items downloaded and how many have extracted content. Omit the spider name to see every crawl in the project. Production crawls export to timestamped JSONL, resume from a checkpoint if interrupted, and skip already-seen URLs via DeltaFetch.

Production crawls can run for hours or days. Use --limit for testing first.

Next Steps

Installation Guide

Detailed installation instructions for all platforms

CLI Reference

Complete command reference

Configuration

Configure proxies, databases, and S3 storage

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Prerequisites

Installation

Your First Scraper

Explore More Examples

E-Commerce

Forums

Cloudflare-Protected

Real Estate

Using with AI Agents

Production Crawling

Next Steps

Installation Guide

CLI Reference

Configuration

​Prerequisites

​Installation

​Your First Scraper

​Explore More Examples

E-Commerce

Forums

Cloudflare-Protected

Real Estate

​Using with AI Agents

​Production Crawling

​Next Steps

Installation Guide

CLI Reference

Configuration

Prerequisites

Installation

Your First Scraper

Explore More Examples

Using with AI Agents

Production Crawling

Next Steps