Skip to main content

Vision

AI agents write spider configs. Humans share them. Stop rebuilding scrapers everyone needs.ScrapAI is infrastructure for reliable, reusable web scraping. Like npm for JavaScript or PyPI for Python, but for web scrapers.

Current State (v0.x)

Database-First Management

Write once, use forever with persistent spider storage

CloakBrowser Integration

Cloudflare bypass and JavaScript rendering

Incremental Crawling

DeltaFetch - only scrape new content

Smart Proxy Middleware

Auto-escalation when blocked

Checkpoint Resume

Production-grade pause and resume

S3 Cloud Storage

Automatic cloud backup

Multiple Extractors

Newspaper, trafilatura, and custom strategies

Named Callbacks

Custom field extraction

Queue System

Batch processing support

Cross-Platform

Linux, macOS, Windows via WSL
Test coverage: 26% overall (critical modules: 70-100%)

Phase 1A: Minimal REST API

1

Timeline

Q2 2026 - Month 1-2
2

Goal

Enable programmatic access and single-article scraping immediately

Core API Endpoints

Single Article Scraping

The killer feature - Extract single articles without full crawls
POST /api/scrape
{
  "url": "https://nytimes.com/2026/03/02/article",
  "spider": "nytimes"  # Uses saved CSS selectors from DB
}
# Returns: title, content, author, date in ~2 seconds
Use cases: RSS feed integration, real-time monitoring, AI agents testing configs

Additional Endpoints

Trigger full crawl programmatically
Get crawl results for a specific spider
List all available spiders
Check crawl progress in real-time

Technical Stack

  • FastAPI - Async Python framework
  • API Key Auth - Secure authentication
  • Rate Limiting - Prevent abuse
Why first: Enables AI agents (OpenClaw, Claude Code, etc.) to integrate immediately. Single-article API validates core value prop before building marketplace infrastructure.

Phase 1B: Spider Library

1

Timeline

Q2 2026 - Month 2-3
2

Goal

Make spiders shareable and reusable - multiply API value

Spider Marketplace

Every developer/AI agent rebuilds scrapers for the same sites (NYT, BBC, Amazon, etc.)

Features

Spider Registry

Browse, search, and download configs from the community

Template Gallery

Pre-built templates for news, e-commerce, jobs, forums, government

Quality Indicators

Downloads, success rate, last updated, community ratings

Easy Import

One-command installation from registry

Versioning

Track changes, rollback when sites update

Community Driven

Collaborate on maintaining spider configs

Quick Start

# Import a spider from the registry
./scrapai spiders import --from-registry nytimes

# Publish your own spider
./scrapai spiders publish my-spider --registry community

Benefits

  • Save days of development time
  • Production-tested configs
  • Community maintenance
  • Focus on data, not selectors
  • Skip spider building entirely
  • Instant data access - load config, start scraping
  • API + library = scrape hundreds of sites programmatically
Initial collection: 50+ spiders for top news sites, e-commerce, and job boards

Spider Publishing

Publish your spiders with ./scrapai spiders publish <spider> --registry community. Include documentation, example URLs, and choose a license (MIT, Apache, CC0).
Why after API: API creates demand, library multiplies value. Early adopters use API with their spiders, then library makes API 10x more useful.

Phase 2: Quality & Advanced API

1

Timeline

Q3 2026
2

Goal

Make it reliable and production-ready

Data Quality & Validation

Problem: Scraped data might be incomplete or wrong

Features

Schema Validation

Require fields: title, content, date

Quality Scoring

Completeness and content length checks

Anomaly Detection

Detect site changes and broken selectors

Auto-Alerts

Email, Slack, webhook notifications

Validation Reports

Per-crawl quality metrics

Early Detection

Catch breakage before it impacts production

Advanced API Features

Beyond Phase 1A basics:
Real-time notifications for crawl completion and spider failures
Live crawl progress updates
Scrape multiple URLs in one request
Interactive API documentation
Python and JavaScript SDKs
OAuth, team management, usage analytics
Why: Phase 1A validates core API, Phase 2 adds production features based on real usage

Get Involved

Priority Drivers

1

Community Feedback

What you actually need drives development
2

Production Pain Points

What breaks in real usage gets fixed first
3

Ecosystem Trends

AI agents and new anti-bot systems shape features

Version Timeline

v0.1.0 - Pre-1.0 alpha/beta
Breaking changes expected until v1.0. We’ll provide migration guides.

Maintained by DiscourseLab

Last updated: March 2026