Vision
AI agents write spider configs. Humans share them. Stop rebuilding scrapers everyone needs.ScrapAI is infrastructure for reliable, reusable web scraping. Like npm for JavaScript or PyPI for Python, but for web scrapers.
Current State (v0.x)
Database-First Management
Write once, use forever with persistent spider storage
CloakBrowser Integration
Cloudflare bypass and JavaScript rendering
Incremental Crawling
DeltaFetch - only scrape new content
Smart Proxy Middleware
Auto-escalation when blocked
Checkpoint Resume
Production-grade pause and resume
S3 Cloud Storage
Automatic cloud backup
Multiple Extractors
Newspaper, trafilatura, and custom strategies
Named Callbacks
Custom field extraction
Queue System
Batch processing support
Cross-Platform
Linux, macOS, Windows via WSL
Test coverage: 26% overall (critical modules: 70-100%)
Phase 1A: Minimal REST API
Core API Endpoints
Single Article Scraping
The killer feature - Extract single articles without full crawlsUse cases: RSS feed integration, real-time monitoring, AI agents testing configs
Additional Endpoints
POST /api/crawl
POST /api/crawl
Trigger full crawl programmatically
GET /api/results/{spider}
GET /api/results/{spider}
Get crawl results for a specific spider
GET /api/spiders
GET /api/spiders
List all available spiders
GET /api/crawls/{id}/status
GET /api/crawls/{id}/status
Check crawl progress in real-time
Technical Stack
- FastAPI - Async Python framework
- API Key Auth - Secure authentication
- Rate Limiting - Prevent abuse
Why first: Enables AI agents (OpenClaw, Claude Code, etc.) to integrate immediately. Single-article API validates core value prop before building marketplace infrastructure.
Phase 1B: Spider Library
Spider Marketplace
- Problem
- Solution
Every developer/AI agent rebuilds scrapers for the same sites (NYT, BBC, Amazon, etc.)
Features
Spider Registry
Browse, search, and download configs from the community
Template Gallery
Pre-built templates for news, e-commerce, jobs, forums, government
Quality Indicators
Downloads, success rate, last updated, community ratings
Easy Import
One-command installation from registry
Versioning
Track changes, rollback when sites update
Community Driven
Collaborate on maintaining spider configs
Quick Start
Benefits
For Developers
For Developers
- Save days of development time
- Production-tested configs
- Community maintenance
- Focus on data, not selectors
For AI Agents
For AI Agents
- Skip spider building entirely
- Instant data access - load config, start scraping
- API + library = scrape hundreds of sites programmatically
Initial collection: 50+ spiders for top news sites, e-commerce, and job boards
Spider Publishing
Publish your spiders with./scrapai spiders publish <spider> --registry community. Include documentation, example URLs, and choose a license (MIT, Apache, CC0).
Why after API: API creates demand, library multiplies value. Early adopters use API with their spiders, then library makes API 10x more useful.
Phase 2: Quality & Advanced API
Data Quality & Validation
Features
Schema Validation
Require fields: title, content, date
Quality Scoring
Completeness and content length checks
Anomaly Detection
Detect site changes and broken selectors
Auto-Alerts
Email, Slack, webhook notifications
Validation Reports
Per-crawl quality metrics
Early Detection
Catch breakage before it impacts production
Advanced API Features
Beyond Phase 1A basics:Webhooks
Webhooks
Real-time notifications for crawl completion and spider failures
WebSockets
WebSockets
Live crawl progress updates
Batch Operations
Batch Operations
Scrape multiple URLs in one request
OpenAPI/Swagger
OpenAPI/Swagger
Interactive API documentation
Client Libraries
Client Libraries
Python and JavaScript SDKs
Advanced Auth
Advanced Auth
OAuth, team management, usage analytics
Why: Phase 1A validates core API, Phase 2 adds production features based on real usage
Get Involved
Discussions
Join the community conversation
Feature Requests
Suggest new features
Spider Contributions
Coming in Phase 1B (Spider Marketplace)
GitHub
Star the project and contribute
Priority Drivers
Version Timeline
- Current
- v0.5.0
- v1.0.0
- v2.0.0
v0.1.0 - Pre-1.0 alpha/beta
Maintained by DiscourseLab
Last updated: March 2026