How It Works
Efficiency gains:
- Reduces bandwidth usage
- Faster crawl times
- Lower server load
- Cost savings on large-scale crawls
Configuration
Basic Setup
spider.json
Custom Storage Location
By default, hashes are stored in.scrapy/deltafetch/<spider_name>/. You can customize this:
spider.json
Reset Options
Via configuration (recommended for testing):spider.json
Complete Example
News Site with Daily Updates
news_spider.json
Blog with Weekly Updates
blog_spider.json
Combining with Other Features
DeltaFetch + Cloudflare Bypass
spider.json
DeltaFetch + Sitemap
spider.json
DeltaFetch + Proxy
spider.json
Monitoring
Check Log Output
Look for DeltaFetch debug messages:Check Storage
Statistics
Scrape stats show skipped items:Troubleshooting
Not Skipping Any Pages
Check if first crawl
First crawl never skips (nothing to compare against)Run crawl again to see skipping behavior
Skipping Pages That Should Be Re-Crawled
Force re-crawl by deleting hash database or usingDELTAFETCH_RESET: true (see Reset Options above).
Pages Changed But Not Detected
Possible causes:-
Content hash unchanged
- Minor changes (timestamps, ads) may not affect core content hash
- DeltaFetch compares content body, not dynamic elements
-
Cache issues
- Clear hash database and re-crawl
-
Spider extracts different content
- Check if selectors are targeting correct content
Storage Growing Too Large
Check size:- Many unique pages crawled (normal)
- Consider periodic cleanup for old sites
Limitations
- First crawl is always full - no prior hashes to compare
- Hash database is local - not synced across machines
- Content-based detection - minor metadata changes (timestamps, ads) may not trigger re-crawl
Use Cases
Ideal for:- News sites - Daily updates with 1000s of articles (95-98% reduction)
- Product catalogs - Skip unchanged prices, only scrape new products and updates
- Job boards - Focus on new postings, skip filled positions
- Documentation sites - Detect updated pages, efficient monitoring
Best Practices
Enable for recurring crawls
DeltaFetch is most useful for spiders that run repeatedly (daily, weekly, monthly). Not needed for one-time crawls.
Performance Metrics
Example: News site with 5000 articles| Crawl | Articles Scraped | Articles Skipped | Time Saved |
|---|---|---|---|
| First | 5000 | 0 | - |
| Day 2 | 50 | 4950 | 99% |
| Day 3 | 75 | 4925 | 98.5% |
| Week 2 | 200 | 4800 | 96% |
- Daily updates: 95-99% reduction in pages processed
- Weekly updates: 90-95% reduction
- Monthly updates: 80-90% reduction
Related Guides
Checkpoint Resume
Pause and resume long crawls
Queue Processing
Batch process multiple websites
Cloudflare Bypass
Handle protected sites