Incremental Crawling (DeltaFetch)

DeltaFetch skips pages that haven’t changed since the last crawl. First crawl scrapes everything; subsequent crawls only process new or modified pages.

How It Works

First crawl

Scrapes all pages and stores content hashes

Subsequent crawls

Compares page hashes before processing

Skip unchanged

Pages with matching hashes are skipped

Process changed

Only new/modified pages are scraped

Efficiency gains:

Reduces bandwidth usage
Faster crawl times
Lower server load
Cost savings on large-scale crawls

Configuration

Basic Setup

spider.json

{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}

That’s it! DeltaFetch is now enabled for your spider.

Custom Storage Location

By default, hashes are stored in .scrapy/deltafetch/<spider_name>/. You can customize this:

spider.json

{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_DIR": ".scrapy/deltafetch/my_spider"
  }
}

Reset Options

Via configuration (recommended for testing):

spider.json

{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_RESET": true  // Remove after one crawl
  }
}

Via file system:

# Delete all spiders
rm -rf .scrapy/deltafetch/

# Delete specific spider
rm -rf .scrapy/deltafetch/<spider_name>/

Complete Example

News Site with Daily Updates

news_spider.json

{
  "name": "dailynews",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/news"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    },
    {
      "allow": ["/news/"],
      "callback": null,
      "follow": true,
      "priority": 50
    }
  ],
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DOWNLOAD_DELAY": 1
  }
}

First crawl (Monday):

./scrapai crawl dailynews --project news

Scraped 500 articles (first crawl - all new)
Stored 500 content hashes

Second crawl (Tuesday):

./scrapai crawl dailynews --project news

[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/article/old-1
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/article/old-2
...
Scraped 50 articles (only new/modified)
Skipped 450 unchanged articles

Blog with Weekly Updates

blog_spider.json

{
  "name": "techblog",
  "allowed_domains": ["techblog.com"],
  "start_urls": ["https://techblog.com/posts"],
  "rules": [
    {
      "allow": ["/post/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_DIR": ".scrapy/deltafetch/techblog"
  }
}

Weekly cron job:

# crontab entry
0 2 * * 1 cd /path/to/scrapai && ./scrapai crawl techblog --project blogs

Only new posts from the past week are scraped.

Combining with Other Features

DeltaFetch + Cloudflare Bypass

spider.json

{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid"
  }
}

Skip unchanged pages while handling Cloudflare protection.

DeltaFetch + Sitemap

spider.json

{
  "settings": {
    "USE_SITEMAP": true,
    "DELTAFETCH_ENABLED": true
  }
}

Crawl sitemap URLs but skip unchanged pages.

DeltaFetch + Proxy

spider.json

{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}

./scrapai crawl myspider --project proj --proxy-type datacenter

Combine incremental crawling with smart proxy usage.

Monitoring

Check Log Output

Look for DeltaFetch debug messages:

[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/page1
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/page2

These indicate pages being skipped.

Check Storage

# View size of hash database
ls -lh .scrapy/deltafetch/

# Example output:
drwxr-xr-x  3 user  staff    96B Feb 24 10:30 myspider/

# View spider-specific storage
ls -lh .scrapy/deltafetch/myspider/

# Example output:
-rw-r--r--  1 user  staff   128K Feb 24 10:30 hashes.db

Statistics

Scrape stats show skipped items:

'deltafetch/skipped': 450,
'item_scraped_count': 50,

Troubleshooting

Not Skipping Any Pages

Verify setting is enabled

Check DELTAFETCH_ENABLED: true in spider settings

Check if first crawl

First crawl never skips (nothing to compare against)Run crawl again to see skipping behavior

Verify storage directory exists

ls -la .scrapy/deltafetch/

If directory is empty, first crawl hasn’t completed yet

Check hash database has data

ls -lh .scrapy/deltafetch/<spider_name>/

File should have non-zero size

Skipping Pages That Should Be Re-Crawled

Force re-crawl by deleting hash database or using DELTAFETCH_RESET: true (see Reset Options above).

Pages Changed But Not Detected

Possible causes:

Content hash unchanged
- Minor changes (timestamps, ads) may not affect core content hash
- DeltaFetch compares content body, not dynamic elements
Cache issues
- Clear hash database and re-crawl
Spider extracts different content
- Check if selectors are targeting correct content

Storage Growing Too Large

Check size:

du -sh .scrapy/deltafetch/<spider_name>/

Large storage indicates:

Many unique pages crawled (normal)
Consider periodic cleanup for old sites

Cleanup old data:

# Full reset
rm -rf .scrapy/deltafetch/<spider_name>/

# Or move to archive
mv .scrapy/deltafetch/<spider_name>/ .scrapy/deltafetch/<spider_name>.old/

Limitations

First crawl is always full - no prior hashes to compare
Hash database is local - not synced across machines
Content-based detection - minor metadata changes (timestamps, ads) may not trigger re-crawl

Use Cases

Ideal for:

News sites - Daily updates with 1000s of articles (95-98% reduction)
Product catalogs - Skip unchanged prices, only scrape new products and updates
Job boards - Focus on new postings, skip filled positions
Documentation sites - Detect updated pages, efficient monitoring

Best Practices

Enable for recurring crawls

DeltaFetch is most useful for spiders that run repeatedly (daily, weekly, monthly). Not needed for one-time crawls.

Monitor storage growth

Periodically check and clean old hash databases

Test reset behavior

Use DELTAFETCH_RESET: true to test full re-crawl behavior

Performance Metrics

Example: News site with 5000 articles

Crawl	Articles Scraped	Articles Skipped	Time Saved
First	5000	0	-
Day 2	50	4950	99%
Day 3	75	4925	98.5%
Week 2	200	4800	96%

Typical efficiency gains:

Daily updates: 95-99% reduction in pages processed
Weekly updates: 90-95% reduction
Monthly updates: 80-90% reduction

Checkpoint Resume

Pause and resume long crawls

Queue Processing

Batch process multiple websites

Cloudflare Bypass

Handle protected sites

​How It Works

​Configuration

​Basic Setup

​Custom Storage Location

​Reset Options

​Complete Example

​News Site with Daily Updates

​Blog with Weekly Updates

​Combining with Other Features

​DeltaFetch + Cloudflare Bypass

​DeltaFetch + Sitemap

​DeltaFetch + Proxy

​Monitoring

​Check Log Output

​Check Storage

​Statistics

​Troubleshooting

​Not Skipping Any Pages

​Skipping Pages That Should Be Re-Crawled

​Pages Changed But Not Detected

​Storage Growing Too Large

​Limitations

​Use Cases

​Best Practices

​Performance Metrics

​Related Guides

Checkpoint Resume

Queue Processing

Cloudflare Bypass

How It Works

Configuration

Basic Setup

Custom Storage Location

Reset Options

Complete Example

News Site with Daily Updates

Blog with Weekly Updates

Combining with Other Features

DeltaFetch + Cloudflare Bypass

DeltaFetch + Sitemap

DeltaFetch + Proxy

Monitoring

Check Log Output

Check Storage

Statistics

Troubleshooting

Not Skipping Any Pages

Skipping Pages That Should Be Re-Crawled

Pages Changed But Not Detected

Storage Growing Too Large

Limitations

Use Cases

Best Practices

Performance Metrics

Related Guides