Roadmap - scrapai

Vision

AI agents write spider configs. Humans share them. Stop rebuilding scrapers everyone needs.scrapai is infrastructure for reliable, reusable web scraping. Like npm for JavaScript or PyPI for Python, but for web scrapers.

Current State (v0.x)

Database-First Management

Write once, use forever with persistent spider storage

CloakBrowser Integration

Cloudflare bypass and JavaScript rendering

Incremental Crawling

DeltaFetch - only scrape new content

Smart Proxy Middleware

Auto-escalation when blocked

Checkpoint Resume

Production-grade pause and resume

S3 Cloud Storage

Automatic cloud backup

Multiple Extractors

Newspaper, trafilatura, and custom strategies

Named Callbacks

Custom field extraction

Queue System

Batch processing support

Cross-Platform

Linux, macOS, Windows via WSL

Test coverage: 26% overall (critical modules: 70-100%)

Phase 1A: Minimal REST API

Timeline

Q2 2026 - Month 1-2

Goal

Enable programmatic access and single-article scraping immediately

Core API Endpoints

Single Article Scraping

The killer feature - Extract single articles without full crawls

POST /api/scrape
{
  "url": "https://nytimes.com/2026/03/02/article",
  "spider": "nytimes"  # Uses saved CSS selectors from DB
}
# Returns: title, content, author, date in ~2 seconds

Use cases: RSS feed integration, real-time monitoring, AI agents testing configs

Additional Endpoints

POST /api/crawl

Trigger full crawl programmatically

GET /api/results/{spider}

Get crawl results for a specific spider

GET /api/spiders

List all available spiders

GET /api/crawls/{id}/status

Check crawl progress in real-time

Technical Stack

FastAPI - Async Python framework
API Key Auth - Secure authentication
Rate Limiting - Prevent abuse

Why first: Enables AI agents (OpenClaw, Claude Code, etc.) to integrate immediately. Single-article API validates core value prop before building marketplace infrastructure.

Phase 1B: Spider Library

Timeline

Q2 2026 - Month 2-3

Goal

Make spiders shareable and reusable - multiply API value

Spider Marketplace

Problem
Solution

Every developer/AI agent rebuilds scrapers for the same sites (NYT, BBC, Amazon, etc.)

Features

Spider Registry

Browse, search, and download configs from the community

Template Gallery

Pre-built templates for news, e-commerce, jobs, forums, government

Quality Indicators

Downloads, success rate, last updated, community ratings

Easy Import

One-command installation from registry

Versioning

Track changes, rollback when sites update

Community Driven

Collaborate on maintaining spider configs

Quick Start

# Import a spider from the registry
./scrapai spiders import --from-registry nytimes

# Publish your own spider
./scrapai spiders publish my-spider --registry community

Benefits

For Developers

Save days of development time
Production-tested configs
Community maintenance
Focus on data, not selectors

For AI Agents

Skip spider building entirely
Instant data access - load config, start scraping
API + library = scrape hundreds of sites programmatically

Initial collection: 50+ spiders for top news sites, e-commerce, and job boards

Spider Publishing

Publish your spiders with ./scrapai spiders publish <spider> --registry community. Include documentation, example URLs, and choose a license (MIT, Apache, CC0).

Why after API: API creates demand, library multiplies value. Early adopters use API with their spiders, then library makes API 10x more useful.

Phase 2: Quality & Advanced API

Timeline

Q3 2026

Goal

Make it reliable and production-ready

Data Quality & Validation

Problem: Scraped data might be incomplete or wrong

Features

Schema Validation

Require fields: title, content, date

Quality Scoring

Completeness and content length checks

Anomaly Detection

Detect site changes and broken selectors

Auto-Alerts

Email, Slack, webhook notifications

Validation Reports

Per-crawl quality metrics

Early Detection

Catch breakage before it impacts production

Advanced API Features

Beyond Phase 1A basics:

Webhooks

Real-time notifications for crawl completion and spider failures

WebSockets

Live crawl progress updates

Batch Operations

Scrape multiple URLs in one request

OpenAPI/Swagger

Interactive API documentation

Client Libraries

Python and JavaScript SDKs

Advanced Auth

OAuth, team management, usage analytics

Why: Phase 1A validates core API, Phase 2 adds production features based on real usage

Get Involved

Discussions

Join the community conversation

Feature Requests

Suggest new features

Spider Contributions

Coming in Phase 1B (Spider Marketplace)

GitHub

Star the project and contribute

Priority Drivers

Community Feedback

What you actually need drives development

Production Pain Points

What breaks in real usage gets fixed first

Ecosystem Trends

AI agents and new anti-bot systems shape features

Version Timeline

Current
v0.5.0
v1.0.0
v2.0.0

v0.1.0 - Pre-1.0 alpha/beta

Breaking changes expected until v1.0. We’ll provide migration guides.

Maintained by DiscourseLab

Last updated: March 2026

Vision

​Current State (v0.x)

Database-First Management

CloakBrowser Integration

Incremental Crawling

Smart Proxy Middleware

Checkpoint Resume

S3 Cloud Storage

Multiple Extractors

Named Callbacks

Queue System

Cross-Platform

​Phase 1A: Minimal REST API

​Core API Endpoints

Single Article Scraping

​Additional Endpoints

​Technical Stack

​Phase 1B: Spider Library

​Spider Marketplace

​Features

Spider Registry

Template Gallery

Quality Indicators

Easy Import

Versioning

Community Driven

​Quick Start

​Benefits

​Spider Publishing

​Phase 2: Quality & Advanced API

​Data Quality & Validation

​Features

Schema Validation

Quality Scoring

Anomaly Detection

Auto-Alerts

Validation Reports

Early Detection

​Advanced API Features

​Get Involved

Discussions

Feature Requests

Spider Contributions

GitHub

​Priority Drivers

​Version Timeline

Maintained by DiscourseLab

Current State (v0.x)

Phase 1A: Minimal REST API

Core API Endpoints

Additional Endpoints

Technical Stack

Phase 1B: Spider Library

Spider Marketplace

Features

Quick Start

Benefits

Spider Publishing

Phase 2: Quality & Advanced API

Data Quality & Validation

Features

Advanced API Features

Get Involved

Priority Drivers

Version Timeline