Skip to main content
ScrapAI CLI

AI-Powered Web Scraping at Scale

ScrapAI transforms web scraping from a code-heavy engineering task into a conversational workflow. Describe what you want to scrape in plain English, and an AI agent analyzes the site, writes extraction rules, and deploys a production-ready scraper—all in minutes.
You: "Add https://bbc.co.uk to my news project"
Minutes later you have a tested, production-ready scraper stored in a database. No Python, no CSS selectors, no Scrapy knowledge. The AI agent analyzes the site, writes extraction rules, verifies quality, and saves a reusable config. Run it tomorrow or next year. Same command, no AI costs.
Built by DiscourseLab and used in production across 500+ websites.

Why ScrapAI?

AI Once, Deterministic Forever

Use AI at build time to analyze sites and write extraction rules. Then run those rules with Scrapy—no AI in the loop, no per-page costs. The cost is per website, not per page.

Self-Hosted, No Vendor Lock-In

You clone the repo, you own everything. No SaaS, no subscription, no per-page billing. Your scrapers are JSON configs in a database. Export them, share them, move them between projects.

Database-First Management

Spiders are rows in a database, not Python files on disk. Need to change DOWNLOAD_DELAY across your whole fleet? One SQL query instead of editing 100 files.

Production-Ready from Day One

Cloudflare bypass with cookie caching, smart proxy escalation, checkpoint pause/resume, incremental crawling, and targeted extraction for articles, products, jobs, and more.

Who This Is For

  • Teams that need to scrape many websites and don’t want to write individual scrapers
  • Non-technical users who can describe what they want in plain English
  • Organizations where scraping is a means to an end, not the core competency
  • Anyone building datasets from public web content (news, research, documentation)
  • Single-site scraping where you want fine-grained control (use Scrapling or crawl4ai)
  • Sites with hard CAPTCHAs (we handle Cloudflare challenges, not Capsolver-level CAPTCHAs)
  • Login-required or paywall content (not supported yet)

How It Works

ScrapAI is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.
You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
                       (once)                               (forever)
Why JSON configs instead of AI-generated Python? An agent that writes and executes Python has the same power as an unsupervised developer. If it hallucinates, gets prompt-injected by a malicious page, or loses context, it can do real damage. An agent that writes JSON configs produces data, not code.
Here’s what an AI-generated spider config looks like:
{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    },
    {
      "allow": ["/news/?$"],
      "follow": true
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2
  }
}
Adding a new website means adding a new row to the database.

Key Features

Cloudflare Bypass

Solves the challenge once, extracts session cookies, then switches to fast HTTP requests. On a 1,000-page crawl: 8 minutes vs 2+ hours.

Smart Proxy Escalation

Starts with direct connections. If a site blocks you (403/429), retries through a datacenter proxy and remembers that domain for next time.

Checkpoint Pause/Resume

Press Ctrl+C to pause a long crawl, run the same command to resume. Built on Scrapy’s native JOBDIR. No progress lost.

Incremental Crawling

DeltaFetch skips already-scraped URLs, reducing bandwidth by 80-90% on routine re-crawls.

Targeted Extraction

Articles get clean structured fields (title, content, author, date). Non-article content (products, jobs, listings) gets custom callbacks with field-level selectors.

Queue & Batch Processing

Bulk-add hundreds of URLs into a database-backed queue with priorities, status tracking, and retry on failure. Process them in parallel batches.

What’s Under the Hood

ScrapAI is glue. These projects do the heavy lifting:
  • Scrapy for crawling. Everything runs through Scrapy; we just load configs from a database instead of Python files.
  • newspaper4k and trafilatura for article extraction (title, content, author, date).
  • nodriver for Cloudflare bypass via browser automation.
  • Playwright for JavaScript rendering.
  • SQLAlchemy and Alembic for the database layer and migrations.

Get Started

View on GitHub

Star the repository if you find it useful