AI-Powered Web Scraping at Scale
ScrapAI transforms web scraping from a code-heavy engineering task into a conversational workflow. Describe what you want to scrape in plain English, and an AI agent analyzes the site, writes extraction rules, and deploys a production-ready scraper—all in minutes.Built by DiscourseLab and used in production across 500+ websites.
Why ScrapAI?
AI Once, Deterministic Forever
Use AI at build time to analyze sites and write extraction rules. Then run those rules with Scrapy—no AI in the loop, no per-page costs. The cost is per website, not per page.
Self-Hosted, No Vendor Lock-In
You clone the repo, you own everything. No SaaS, no subscription, no per-page billing. Your scrapers are JSON configs in a database. Export them, share them, move them between projects.
Database-First Management
Spiders are rows in a database, not Python files on disk. Need to change
DOWNLOAD_DELAY across your whole fleet? One SQL query instead of editing 100 files.Production-Ready from Day One
Cloudflare bypass with cookie caching, smart proxy escalation, checkpoint pause/resume, incremental crawling, and targeted extraction for articles, products, jobs, and more.
Who This Is For
Good fit
Good fit
- Teams that need to scrape many websites and don’t want to write individual scrapers
- Non-technical users who can describe what they want in plain English
- Organizations where scraping is a means to an end, not the core competency
- Anyone building datasets from public web content (news, research, documentation)
How It Works
ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
Key Features
Cloudflare Bypass
Solves the challenge once, extracts session cookies, then switches to fast HTTP requests. On a 1,000-page crawl: 8 minutes vs 2+ hours.
Smart Proxy Escalation
Starts with direct connections. If a site blocks you (403/429), retries through a datacenter proxy and remembers that domain for next time.
Checkpoint Pause/Resume
Press Ctrl+C to pause a long crawl, run the same command to resume. Built on Scrapy’s native JOBDIR. No progress lost.
Incremental Crawling
DeltaFetch skips already-scraped URLs, reducing bandwidth by 80-90% on routine re-crawls.
Targeted Extraction
Articles get clean structured fields (title, content, author, date). Non-article content (products, jobs, listings) gets custom callbacks with field-level selectors.
Queue & Batch Processing
Bulk-add hundreds of URLs into a database-backed queue with priorities, status tracking, and retry on failure. Process them in parallel batches.
What’s Under the Hood
ScrapAI is glue. These projects do the heavy lifting:- Scrapy for crawling. Everything runs through Scrapy; we just load configs from a database instead of Python files.
- newspaper4k and trafilatura for article extraction (title, content, author, date).
- nodriver for Cloudflare bypass via browser automation.
- Playwright for JavaScript rendering.
- SQLAlchemy and Alembic for the database layer and migrations.
Get Started
Installation
Install ScrapAI CLI on Linux, macOS, or Windows
Quick Start
Build your first scraper in 5 minutes
CLI Reference
Complete command reference
View on GitHub
Star the repository if you find it useful