Prerequisites
Before you begin, ensure you have:- Python 3.9 or higher
- Git
- Terminal access
ScrapAI works on Linux, macOS, and Windows. The setup process is identical across all platforms.
Installation
Run setup
- Creates a virtual environment (
.venv) - Installs all Python dependencies
- Installs Playwright Chromium browser
- Initializes SQLite database
- Creates
.envconfiguration file - Configures Claude Code permissions (if using AI agents)
Your First Scraper
Let’s import and run a pre-built spider for BBC News.Import the spider
ScrapAI includes example spiders in the This imports a spider configuration that knows how to:
templates/ directory. Let’s import the BBC News spider:- Find BBC news article URLs
- Extract titles, content, authors, and publish dates
- Handle multiple BBC sections (news, sport, food, etc.)
Run a test crawl
Run the spider in test mode (limits to 5 items):You’ll see Scrapy crawling in action:
View the results
Inspect the scraped data:This displays all scraped items in a formatted table with titles, URLs, and timestamps.
Understanding the Spider Config
Let’s look at what the BBC spider configuration contains:Understanding the fields
Understanding the fields
- name: Unique identifier for the spider
- allowed_domains: Only crawl URLs from these domains
- start_urls: Where the spider begins crawling
- rules: URL patterns to follow or extract data from
- allow: Regex patterns to match URLs
- deny: Regex patterns to exclude URLs
- callback: What to do with matched URLs (
parse_articleextracts article content)
- settings: Spider-specific configuration
- EXTRACTOR_ORDER: Try newspaper4k first, fall back to trafilatura
- DOWNLOAD_DELAY: Wait 1 second between requests (be polite)
- CONCURRENT_REQUESTS: Crawl up to 16 pages simultaneously
- ROBOTSTXT_OBEY: Respect the site’s robots.txt
Explore More Examples
ScrapAI includes several ready-to-use spider templates:E-Commerce
Forums
Cloudflare-Protected
Real Estate
Using with AI Agents
ScrapAI is designed to work with AI coding agents like Claude Code. Instead of manually writing JSON configs, you describe what you want in plain English:Production Crawling
For production crawls without limits:- Checkpoint pause/resume: Press Ctrl+C to pause, re-run to resume
- JSONL export: Data automatically exported to
data/exports/with timestamps - Incremental crawling: Skip already-scraped URLs on subsequent runs