health command tests all spiders, detects failures (extraction vs crawling), and generates a markdown report for automated fixing.
Quick Start
How It Works
Two Failure Modes
- Extraction Broken
- Crawling Broken
Symptoms: Finds articles but extracts empty/incomplete fieldsCause: CSS selectors changed (e.g.,
.article-content → .article-body)Report shows: Items found, content too short, test URL, sample outputFix: Update extraction selectorsFixing Broken Spiders
Example: BBC Spider Breaks
1. Health check detects failure:article[data-component="text-block"] to [data-component="article-body"] p, and updates config.
3. Verify:
Automated Testing
Cron Setup
Notifications
Command Options
| Option | Description | Default |
|---|---|---|
--project | Project name (required) | - |
--limit | Items to test per spider | 5 |
--min-content-length | Min chars to pass extraction | 50 |
--report | Custom report path | data/<project>/health/<date>/report.md |
Best Practices
- Monthly testing for most spiders
- Weekly testing for critical sources
- Adjust thresholds per content type (
--min-content-length) - Keep reports for trend analysis
- Batch fixes when multiple spiders break
Troubleshooting
All Tests Failing
All Tests Failing
Check network, database, or rate limiting issues. Test single spider with verbose logging:
Intermittent Failures
Intermittent Failures
Caused by A/B testing, geo-restrictions, or rate limiting. Run tests multiple times or use
--browser if JS rendering needed.Agent Can't Fix
Agent Can't Fix
Site fundamentally changed (static → JS-rendered, paywall, anti-scraping). Try browser mode, manual inspection, or reconsider viability.
Economics
Time savings at scale:| Spiders | Manual | Agent-Assisted | Saved |
|---|---|---|---|
| 10 | 30 hrs/year | 3 hrs/year | 27 hrs |
| 50 | 150 hrs/year | 15 hrs/year | 135 hrs |
| 100 | 300 hrs/year | 30 hrs/year | 270 hrs |