Introduction

Pronounced: scray-π

AI-Powered Web Scraping at Scale

scrapai transforms web scraping from a code-heavy engineering task into a conversational workflow. Describe what you want to scrape in plain English, and an AI agent analyzes the site, writes extraction rules, and deploys a production-ready scraper—all in minutes.

You: "Add https://bbc.co.uk to my news project"

Minutes later you have a tested, production-ready scraper stored in a database. No Python, no CSS selectors, no Scrapy knowledge. The AI agent analyzes the site, writes extraction rules, verifies quality, and saves a reusable config. Run it tomorrow or next year. Same command, no AI costs.

Built by DiscourseLab and used in production across 500+ websites.

Why scrapai?

AI Once, Deterministic Forever

Use AI at build time to analyze sites and write extraction rules. Then run those rules with Scrapy—no AI in the loop, no per-page costs. The cost is per website, not per page.

Self-Hosted, No Vendor Lock-In

You clone the repo, you own everything. No SaaS, no subscription, no per-page billing. Your scrapers are JSON configs in a database. Export them, share them, move them between projects.

Database-First Management

Spiders are rows in a database, not Python files on disk. Need to change DOWNLOAD_DELAY across your whole fleet? One SQL query instead of editing 100 files.

Production-Ready from Day One

Cloudflare bypass with cookie caching, smart proxy escalation, checkpoint pause/resume, incremental crawling, and targeted extraction for articles, products, jobs, and more.

Who This Is For

Good fit

Teams that need to scrape many websites and don’t want to write individual scrapers
Non-technical users who can describe what they want in plain English
Organizations where scraping is a means to an end, not the core competency
Anyone building datasets from public web content (news, research, documentation)

Not a good fit

Single-site scraping where you want fine-grained control (use Scrapling or crawl4ai)
Sites with hard CAPTCHAs (we handle Cloudflare challenges, not Capsolver-level CAPTCHAs)
Login-required or paywall content (not supported yet)

How It Works

scrapai is an orchestration layer on top of Scrapy. Instead of writing a Python spider file per website, an AI agent generates a JSON config and stores it in a database. A single generic spider (DatabaseSpider) loads any config at runtime.

You (plain English) → AI Agent → JSON config → Database → Scrapy crawl
                       (once)                               (forever)

Why JSON configs instead of AI-generated Python? Safety and predictability. See Security-First Design for details.

Key Features

Cloudflare Bypass

Solves the challenge once, extracts session cookies, then switches to fast HTTP requests. On a 1,000-page crawl: 8 minutes vs 2+ hours.

Smart Proxy Escalation

Starts with direct connections. If a site blocks you (403/429), retries through a datacenter proxy and remembers that domain for next time.

Checkpoint Pause/Resume

Press Ctrl+C to pause a long crawl, run the same command to resume. Built on Scrapy’s native JOBDIR. No progress lost.

Incremental Crawling

DeltaFetch skips already-scraped URLs, reducing bandwidth by 80-90% on routine re-crawls.

Targeted Extraction

Articles get clean structured fields (title, content, author, date). Non-article content (products, jobs, listings) gets custom callbacks with field-level selectors.

Queue & Batch Processing

Bulk-add hundreds of URLs into a database-backed queue with priorities, status tracking, and retry on failure. Process them in parallel batches.

What’s Under the Hood

scrapai is glue. These projects do the heavy lifting:

Scrapy for crawling. Everything runs through Scrapy; we just load configs from a database instead of Python files.
newspaper4k and trafilatura for article extraction (title, content, author, date).
nodriver for Cloudflare bypass via browser automation.
Playwright for JavaScript rendering.
SQLAlchemy and Alembic for the database layer and migrations.

Get Started

Installation

Install scrapai CLI on Linux, macOS, or Windows

Quick Start

Build your first scraper in 5 minutes

CLI Reference

Complete command reference

View on GitHub

Star the repository if you find it useful

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

AI-Powered Web Scraping at Scale

Why scrapai?

AI Once, Deterministic Forever

Self-Hosted, No Vendor Lock-In

Database-First Management

Production-Ready from Day One

Who This Is For

How It Works

Key Features

Cloudflare Bypass

Smart Proxy Escalation

Checkpoint Pause/Resume

Incremental Crawling

Targeted Extraction

Queue & Batch Processing

What’s Under the Hood

Get Started

Installation

Quick Start

CLI Reference

View on GitHub

​AI-Powered Web Scraping at Scale

​Why scrapai?

AI Once, Deterministic Forever

Self-Hosted, No Vendor Lock-In

Database-First Management

Production-Ready from Day One

​Who This Is For

​How It Works

​Key Features

Cloudflare Bypass

Smart Proxy Escalation

Checkpoint Pause/Resume

Incremental Crawling

Targeted Extraction

Queue & Batch Processing

​What’s Under the Hood

​Get Started

Installation

Quick Start

CLI Reference

View on GitHub

AI-Powered Web Scraping at Scale

Why scrapai?

Who This Is For

How It Works

Key Features

What’s Under the Hood

Get Started