Prerequisites
Before you begin, ensure you have:
Python 3.9 or higher
Git
Terminal access
ScrapAI works on Linux , macOS , and Windows . The setup process is identical across all platforms.
Installation
Clone the repository
git clone https://github.com/discourselab/scrapai-cli.git
cd scrapai-cli
Run setup
This sets up your environment, installs dependencies, and initializes the database. On Windows , use scrapai setup instead of ./scrapai setup.
On Linux , if Chromium fails to launch, install system dependencies: sudo .venv/bin/python -m playwright install-deps chromium
Verify installation
You should see: ✅ Virtual environment exists
✅ Core dependencies installed
✅ Database initialized
🎉 Environment is ready!
Your First Scraper
Let’s import and run a pre-built spider for BBC News.
Import the spider
ScrapAI includes example spiders in the templates/ directory. Let’s import the BBC News spider: ./scrapai spiders import templates/news/bbc_co_uk/analysis/final_spider.json --project news
Run a test crawl
Run the spider in test mode (limits to 5 items): ./scrapai crawl bbc_co_uk --project news --limit 5
Test mode (--limit) stores data in the database for inspection. Production mode (no limit) exports to timestamped JSONL files.
You’ll see Scrapy crawling in action: 2026-02-28 14:30:12 [scrapy.core.engine] INFO: Spider opened
2026-02-28 14:30:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bbc.co.uk/news/articles/...>
{'title': 'Breaking news story title', 'content': '...', 'author': 'BBC News', ...}
View the results
Inspect the scraped data: ./scrapai show bbc_co_uk --project news
Export the data
Export to your preferred format: ./scrapai export bbc_co_uk --project news --format csv
Exports are saved to the data/ directory with timestamps.
Explore More Examples
ScrapAI includes several ready-to-use spider templates:
E-Commerce ./scrapai spiders import templates/ecommerce/amazon_co_uk_mac_accessories/analysis/final_spider.json --project shop
Scrapes product listings with prices, ratings, and descriptions
Forums ./scrapai spiders import templates/forums/news_ycombinator_com/analysis/final_spider.json --project forums
Extracts discussion threads, authors, and timestamps
Cloudflare-Protected ./scrapai spiders import templates/cloudflare/thefga_org/analysis/final_spider.json --project research
Demonstrates Cloudflare bypass with cookie caching
Real Estate ./scrapai spiders import templates/spider-realestate.json --project housing
Property listings with custom field extractors
Using with AI Agents
ScrapAI is designed to work with AI coding agents like Claude Code. Instead of manually writing JSON configs, you describe what you want in plain English:
You: "Add https://techcrunch.com to my news project"
Agent: [Analyzes site, generates rules, tests extraction, deploys spider]
You: "Crawl all spiders in my news project and export to CSV"
Agent: [Executes crawls, exports data]
The ./scrapai setup command automatically configures Claude Code permissions to prevent the agent from modifying framework code—it can only write JSON configs and run CLI commands.
Production Crawling
For production crawls without limits:
./scrapai crawl bbc_co_uk --project news
Production crawls support checkpoint pause/resume, automatic JSONL export, and incremental crawling.
Production crawls can run for hours or days. Use --limit for testing first.
Next Steps
Installation Guide Detailed installation instructions for all platforms
CLI Reference Complete command reference
Configuration Configure proxies, databases, and S3 storage