Skip to main content

Prerequisites

Before you begin, ensure you have:
  • Python 3.9 or higher
  • Git
  • Terminal access
ScrapAI works on Linux, macOS, and Windows. The setup process is identical across all platforms.

Installation

1

Clone the repository

git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
2

Run setup

./scrapai setup
This command:
  • Creates a virtual environment (.venv)
  • Installs all Python dependencies
  • Installs Playwright Chromium browser
  • Initializes SQLite database
  • Creates .env configuration file
  • Configures Claude Code permissions (if using AI agents)
On Windows, use scrapai setup instead of ./scrapai setup.
On Linux, if Chromium fails to launch, install system dependencies:
sudo .venv/bin/python -m playwright install-deps chromium
3

Verify installation

./scrapai verify
You should see:
✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!

Your First Scraper

Let’s import and run a pre-built spider for BBC News.
1

Import the spider

ScrapAI includes example spiders in the templates/ directory. Let’s import the BBC News spider:
./scrapai spiders import templates/news/bbc_co_uk/analysis/final_spider.json --project news
This imports a spider configuration that knows how to:
  • Find BBC news article URLs
  • Extract titles, content, authors, and publish dates
  • Handle multiple BBC sections (news, sport, food, etc.)
2

Run a test crawl

Run the spider in test mode (limits to 5 items):
./scrapai crawl bbc_co_uk --project news --limit 5
Test mode (--limit) stores data in the database for inspection. Production mode (no limit) exports to timestamped JSONL files.
You’ll see Scrapy crawling in action:
2026-02-28 14:30:12 [scrapy.core.engine] INFO: Spider opened
2026-02-28 14:30:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bbc.co.uk/news/articles/...>
{'title': 'Breaking news story title', 'content': '...', 'author': 'BBC News', ...}
3

View the results

Inspect the scraped data:
./scrapai show bbc_co_uk --project news
This displays all scraped items in a formatted table with titles, URLs, and timestamps.
4

Export the data

Export to your preferred format:
./scrapai export bbc_co_uk --project news --format csv
Exports are saved to the data/ directory with timestamps.

Understanding the Spider Config

Let’s look at what the BBC spider configuration contains:
{
  "name": "bbc_co_uk",
  "source_url": "https://bbc.co.uk/",
  "allowed_domains": ["bbc.co.uk", "www.bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/"],
  "rules": [
    {
      "allow": ["/news/articles/.*"],
      "deny": ["/news/articles/.*#comments"],
      "callback": "parse_article"
    },
    {
      "allow": ["/sport/.*/articles/.*"],
      "callback": "parse_article"
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1,
    "CONCURRENT_REQUESTS": 16,
    "ROBOTSTXT_OBEY": true
  }
}
  • name: Unique identifier for the spider
  • allowed_domains: Only crawl URLs from these domains
  • start_urls: Where the spider begins crawling
  • rules: URL patterns to follow or extract data from
    • allow: Regex patterns to match URLs
    • deny: Regex patterns to exclude URLs
    • callback: What to do with matched URLs (parse_article extracts article content)
  • settings: Spider-specific configuration
    • EXTRACTOR_ORDER: Try newspaper4k first, fall back to trafilatura
    • DOWNLOAD_DELAY: Wait 1 second between requests (be polite)
    • CONCURRENT_REQUESTS: Crawl up to 16 pages simultaneously
    • ROBOTSTXT_OBEY: Respect the site’s robots.txt

Explore More Examples

ScrapAI includes several ready-to-use spider templates:

E-Commerce

./scrapai spiders import templates/ecommerce/amazon_co_uk_mac_accessories/analysis/final_spider.json --project shop
Scrapes product listings with prices, ratings, and descriptions

Forums

./scrapai spiders import templates/forums/news_ycombinator_com/analysis/final_spider.json --project forums
Extracts discussion threads, authors, and timestamps

Cloudflare-Protected

./scrapai spiders import templates/cloudflare/thefga_org/analysis/final_spider.json --project research
Demonstrates Cloudflare bypass with cookie caching

Real Estate

./scrapai spiders import templates/spider-realestate.json --project housing
Property listings with custom field extractors

Using with AI Agents

ScrapAI is designed to work with AI coding agents like Claude Code. Instead of manually writing JSON configs, you describe what you want in plain English:
claude
You: "Add https://techcrunch.com to my news project"
Agent: [Analyzes site, generates rules, tests extraction, deploys spider]

You: "Crawl all spiders in my news project and export to CSV"
Agent: [Executes crawls, exports data]
The ./scrapai setup command automatically configures Claude Code permissions to prevent the agent from modifying framework code—it can only write JSON configs and run CLI commands.

Production Crawling

For production crawls without limits:
./scrapai crawl bbc_co_uk --project news
This enables:
  • Checkpoint pause/resume: Press Ctrl+C to pause, re-run to resume
  • JSONL export: Data automatically exported to data/exports/ with timestamps
  • Incremental crawling: Skip already-scraped URLs on subsequent runs
Production crawls can run for hours or days. Use --limit for testing first.

Next Steps