Skip to main content

Prerequisites

Before you begin, ensure you have:
  • Python 3.9 or higher
  • Git
  • Terminal access
ScrapAI works on Linux, macOS, and Windows. The setup process is identical across all platforms.

Installation

1

Clone the repository

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
2

Run setup

./scrapai setup
This sets up your environment, installs dependencies, and initializes the database.
On Windows, use scrapai setup instead of ./scrapai setup.
On Linux, if Chromium fails to launch, install system dependencies:
sudo .venv/bin/python -m playwright install-deps chromium
3

Verify installation

./scrapai verify
You should see:
✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!

Your First Scraper

Let’s import and run a pre-built spider for BBC News.
1

Import the spider

ScrapAI includes example spiders in the templates/ directory. Let’s import the BBC News spider:
./scrapai spiders import templates/news/bbc_co_uk/analysis/final_spider.json --project news
2

Run a test crawl

Run the spider in test mode (limits to 5 items):
./scrapai crawl bbc_co_uk --project news --limit 5
Test mode (--limit) stores data in the database for inspection. Production mode (no limit) exports to timestamped JSONL files.
You’ll see Scrapy crawling in action:
2026-02-28 14:30:12 [scrapy.core.engine] INFO: Spider opened
2026-02-28 14:30:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bbc.co.uk/news/articles/...>
{'title': 'Breaking news story title', 'content': '...', 'author': 'BBC News', ...}
3

View the results

Inspect the scraped data:
./scrapai show bbc_co_uk --project news
4

Export the data

Export to your preferred format:
./scrapai export bbc_co_uk --project news --format csv
Exports are saved to the data/ directory with timestamps.

Explore More Examples

ScrapAI includes several ready-to-use spider templates:

E-Commerce

./scrapai spiders import templates/ecommerce/amazon_co_uk_mac_accessories/analysis/final_spider.json --project shop
Scrapes product listings with prices, ratings, and descriptions

Forums

./scrapai spiders import templates/forums/news_ycombinator_com/analysis/final_spider.json --project forums
Extracts discussion threads, authors, and timestamps

Cloudflare-Protected

./scrapai spiders import templates/cloudflare/thefga_org/analysis/final_spider.json --project research
Demonstrates Cloudflare bypass with cookie caching

Real Estate

./scrapai spiders import templates/spider-realestate.json --project housing
Property listings with custom field extractors

Using with AI Agents

ScrapAI is designed to work with AI coding agents like Claude Code. Instead of manually writing JSON configs, you describe what you want in plain English:
claude
You: "Add https://techcrunch.com to my news project"
Agent: [Analyzes site, generates rules, tests extraction, deploys spider]

You: "Crawl all spiders in my news project and export to CSV"
Agent: [Executes crawls, exports data]
The ./scrapai setup command automatically configures Claude Code permissions to prevent the agent from modifying framework code—it can only write JSON configs and run CLI commands.

Production Crawling

For production crawls without limits:
./scrapai crawl bbc_co_uk --project news
Production crawls support checkpoint pause/resume, automatic JSONL export, and incremental crawling.
Production crawls can run for hours or days. Use --limit for testing first.

Next Steps

Installation Guide

Detailed installation instructions for all platforms

CLI Reference

Complete command reference

Configuration

Configure proxies, databases, and S3 storage