AI-Assisted Maintenance

The health command tests all spiders, detects failures (extraction vs crawling), and generates a markdown report for automated fixing.

Quick Start

# Test all spiders in project
./scrapai health --project news

# Custom options
./scrapai health --project news --limit 10 --min-content-length 100

Output:

✅ nytimes       5 items, extraction OK
❌ bbc           5 items, EXTRACTION BROKEN
❌ techcrunch    0 items, CRAWLING BROKEN

Report saved to: data/news/health/20260302/report.md

Workflow: Detect (5 min) → Fix (5 min) → Verify (2 min) = 12 min vs 45 min manual

How It Works

Runs test crawl

Crawls each spider with --limit 5

Checks crawling

Pass if 3+ items found, fail if less (crawling broken)

Checks extraction

Pass if content ≥ 50 chars, fail if too short (extraction broken)

Generates report

Markdown file with failure details and sample output

Two Failure Modes

Extraction Broken
Crawling Broken

Symptoms: Finds articles but extracts empty/incomplete fieldsCause: CSS selectors changed (e.g., .article-content → .article-body)Report shows: Items found, content too short, test URL, sample outputFix: Update extraction selectors

Symptoms: Can’t find articles (0-2 items)Cause: URL patterns changed (e.g., /blog/ → /articles/)Report shows: Zero items, re-analyze instructionFix: Update crawling rules

Fixing Broken Spiders

Example: BBC Spider Breaks

1. Health check detects failure:

$ ./scrapai health --project news
❌ bbc  5 items, EXTRACTION BROKEN

2. Agent fixes it:

Read data/news/health/20260302/report.md and fix the broken bbc spider.

Agent analyzes site, finds selectors changed from article[data-component="text-block"] to [data-component="article-body"] p, and updates config. 3. Verify:

$ ./scrapai health --project news
✅ bbc  5 items, extraction OK

Automated Testing

Cron Setup

# Monthly testing (recommended)
0 2 1 * * cd /path/to/scrapai-cli && ./scrapai health --project news

# Weekly for critical spiders
0 4 * * 1 cd /path/to/scrapai-cli && ./scrapai health --project critical

Notifications

#!/bin/bash
./scrapai health --project "$1"
if [ $? -ne 0 ]; then
  mail -s "scrapai failures in $1" team@example.com < \
    $(find data/$1/health -name "report.md" | tail -1)
fi

Command Options

Option	Description	Default
`--project`	Project name (required)	-
`--limit`	Items to test per spider	`5`
`--min-content-length`	Min chars to pass extraction	`50`
`--report`	Custom report path	`data/<project>/health/<date>/report.md`

Exit codes: 0 = all passed, 1 = failures detected

Best Practices

Monthly testing for most spiders
Weekly testing for critical sources
Adjust thresholds per content type (--min-content-length)
Keep reports for trend analysis
Batch fixes when multiple spiders break

Troubleshooting

All Tests Failing

Check network, database, or rate limiting issues. Test single spider with verbose logging:

./scrapai crawl spider --project news --limit 1 --scrapy-args '-L DEBUG'

Intermittent Failures

Caused by A/B testing, geo-restrictions, or rate limiting. Run tests multiple times or use --browser if JS rendering needed.

Agent Can't Fix

Site fundamentally changed (static → JS-rendered, paywall, anti-scraping). Try browser mode, manual inspection, or reconsider viability.

Economics

Time savings at scale:

Spiders	Manual	Agent-Assisted	Saved
10	30 hrs/year	3 hrs/year	27 hrs
50	150 hrs/year	15 hrs/year	135 hrs
100	300 hrs/year	30 hrs/year	270 hrs

Cost (100 spiders): Manual =

33,300/year vs Agent-assisted =

9,500/year (including $800 in tokens) Assumes 4 breaks/spider/year, 45 min manual fix, 10 min agent-assisted fix, $100/hr developer cost

Incremental Crawling

Skip unchanged pages with DeltaFetch

Queue Processing

Batch process multiple sites

Custom Callbacks

Complex extraction patterns

Cloudflare Bypass

Handle protected sites

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

AI-Assisted Maintenance

Quick Start

How It Works

Two Failure Modes

Fixing Broken Spiders

Example: BBC Spider Breaks

Automated Testing

Cron Setup

Notifications

Command Options

Best Practices

Troubleshooting

Economics

Incremental Crawling

Queue Processing

Custom Callbacks

Cloudflare Bypass

​Quick Start

​How It Works

​Two Failure Modes

​Fixing Broken Spiders

​Example: BBC Spider Breaks

​Automated Testing

​Cron Setup

​Notifications

​Command Options

​Best Practices

​Troubleshooting

​Economics

​Related Guides

Incremental Crawling

Queue Processing

Custom Callbacks

Cloudflare Bypass

Quick Start

How It Works

Two Failure Modes

Fixing Broken Spiders

Example: BBC Spider Breaks

Automated Testing

Cron Setup

Notifications

Command Options

Best Practices

Troubleshooting

Economics

Related Guides