Why spiders live in the database, not in Python files
ScrapAI stores spiders as database rows, not Python files. This architectural choice enables powerful management patterns impossible with file-based scrapers.
SELECT name, active, created_at, updated_at FROM spiders WHERE project = 'news'ORDER BY created_at DESC;
Output:
Copy
Ask AI
Name Active Created Updated─────────────────────────────────────────────────────bbc_co_uk Yes 2026-01-15 10:23:11 2026-02-20 14:55:32cnn_com Yes 2026-01-15 11:10:45 2026-01-15 11:10:45guardian_co_uk No 2026-01-14 09:12:33 2026-02-18 16:22:10
No filesystem traversal, no parsing Python files, no guessing.
-- Increase delay for all news spidersUPDATE spider_settings SET value = '2'WHERE key = 'DOWNLOAD_DELAY' AND spider_id IN (SELECT id FROM spiders WHERE project = 'news');
With files: Write a script to parse and edit 100 Python files. Hope regex doesn’t break syntax.With database: One SQL query. Done.
-- Spiders created in the last 30 daysSELECT name, created_at FROM spiders WHERE created_at > NOW() - INTERVAL '30 days';-- Spiders not updated in 90 days (stale?)SELECT name, updated_at FROM spiders WHERE updated_at < NOW() - INTERVAL '90 days';
Project Queries
Copy
Ask AI
-- Count spiders per projectSELECT project, COUNT(*) as spider_countFROM spidersGROUP BY projectORDER BY spider_count DESC;-- Output:-- news: 45 spiders-- ecommerce: 23 spiders-- forums: 12 spiders
Activity Queries
Copy
Ask AI
-- Active vs inactive spidersSELECT active, COUNT(*) as countFROM spidersGROUP BY active;-- Output:-- Active: 78-- Inactive: 22
# List all spiders./scrapai db query "SELECT name, project, active FROM spiders"# Count items per spider./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items GROUP BY spider_id"# Find spiders with Cloudflare enabled./scrapai db query "SELECT s.name FROM spiders s JOIN spider_settings ss ON s.id = ss.spider_id WHERE ss.key = 'CLOUDFLARE_ENABLED' AND ss.value = 'true'"
UPDATE/DELETE queries require confirmation via --confirm flag with row count to prevent accidental data loss.
Copy
Ask AI
./scrapai db query "DELETE FROM spiders WHERE name = 'test_spider'" --confirm 1
# Get active spiders./scrapai db query "SELECT name FROM spiders WHERE active = true AND project = 'news'" > active_spiders.txt# Test each spider (5 sample URLs)while read spider; do echo "Testing $spider..." ./scrapai crawl "$spider" --project news --limit 5done < active_spiders.txt# Check for extraction failures./scrapai db query "SELECT spider_id, COUNT(*) FROM scraped_items WHERE scraped_at > NOW() - INTERVAL '1 day' GROUP BY spider_id HAVING COUNT(*) = 0"
-- Increase delay for aggressive sitesUPDATE spider_settings SET value = '3'WHERE key = 'DOWNLOAD_DELAY' AND spider_id IN ( SELECT id FROM spiders WHERE allowed_domains::text LIKE '%amazon%' OR allowed_domains::text LIKE '%ebay%' );