Skip to main content
The health command tests all spiders, detects failures (extraction vs crawling), and generates a markdown report for automated fixing.

Quick Start

# Test all spiders in project
./scrapai health --project news

# Custom options
./scrapai health --project news --limit 10 --min-content-length 100
Output:
✅ nytimes       5 items, extraction OK
❌ bbc           5 items, EXTRACTION BROKEN
❌ techcrunch    0 items, CRAWLING BROKEN

Report saved to: data/news/health/20260302/report.md
Workflow: Detect (5 min) → Fix (5 min) → Verify (2 min) = 12 min vs 45 min manual

How It Works

1

Runs test crawl

Crawls each spider with --limit 5
2

Checks crawling

Pass if 3+ items found, fail if less (crawling broken)
3

Checks extraction

Pass if content ≥ 50 chars, fail if too short (extraction broken)
4

Generates report

Markdown file with failure details and sample output

Two Failure Modes

Symptoms: Finds articles but extracts empty/incomplete fieldsCause: CSS selectors changed (e.g., .article-content.article-body)Report shows: Items found, content too short, test URL, sample outputFix: Update extraction selectors

Fixing Broken Spiders

Example: BBC Spider Breaks

1. Health check detects failure:
$ ./scrapai health --project news
 bbc  5 items, EXTRACTION BROKEN
2. Agent fixes it:
Read data/news/health/20260302/report.md and fix the broken bbc spider.
Agent analyzes site, finds selectors changed from article[data-component="text-block"] to [data-component="article-body"] p, and updates config. 3. Verify:
$ ./scrapai health --project news
 bbc  5 items, extraction OK

Automated Testing

Cron Setup

# Monthly testing (recommended)
0 2 1 * * cd /path/to/scrapai-cli && ./scrapai health --project news

# Weekly for critical spiders
0 4 * * 1 cd /path/to/scrapai-cli && ./scrapai health --project critical

Notifications

#!/bin/bash
./scrapai health --project "$1"
if [ $? -ne 0 ]; then
  mail -s "ScrapAI failures in $1" team@example.com < \
    $(find data/$1/health -name "report.md" | tail -1)
fi

Command Options

OptionDescriptionDefault
--projectProject name (required)-
--limitItems to test per spider5
--min-content-lengthMin chars to pass extraction50
--reportCustom report pathdata/<project>/health/<date>/report.md
Exit codes: 0 = all passed, 1 = failures detected

Best Practices

  • Monthly testing for most spiders
  • Weekly testing for critical sources
  • Adjust thresholds per content type (--min-content-length)
  • Keep reports for trend analysis
  • Batch fixes when multiple spiders break

Troubleshooting

Check network, database, or rate limiting issues. Test single spider with verbose logging:
./scrapai crawl spider --project news --limit 1 --scrapy-args '-L DEBUG'
Caused by A/B testing, geo-restrictions, or rate limiting. Run tests multiple times or use --browser if JS rendering needed.
Site fundamentally changed (static → JS-rendered, paywall, anti-scraping). Try browser mode, manual inspection, or reconsider viability.

Economics

Time savings at scale:
SpidersManualAgent-AssistedSaved
1030 hrs/year3 hrs/year27 hrs
50150 hrs/year15 hrs/year135 hrs
100300 hrs/year30 hrs/year270 hrs
Cost (100 spiders): Manual = 33,300/yearvsAgentassisted=33,300/year vs Agent-assisted = 9,500/year (including $800 in tokens) Assumes 4 breaks/spider/year, 45 min manual fix, 10 min agent-assisted fix, $100/hr developer cost