Skip to main content
DeltaFetch skips pages that haven’t changed since the last crawl. First crawl scrapes everything; subsequent crawls only process new or modified pages.

How It Works

1

First crawl

Scrapes all pages and stores content hashes
2

Subsequent crawls

Compares page hashes before processing
3

Skip unchanged

Pages with matching hashes are skipped
4

Process changed

Only new/modified pages are scraped
Efficiency gains:
  • Reduces bandwidth usage
  • Faster crawl times
  • Lower server load
  • Cost savings on large-scale crawls

Configuration

Basic Setup

spider.json
{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}
That’s it! DeltaFetch is now enabled for your spider.

Custom Storage Location

By default, hashes are stored in .scrapy/deltafetch/<spider_name>/. You can customize this:
spider.json
{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_DIR": ".scrapy/deltafetch/my_spider"
  }
}

Reset (For Testing)

Setting DELTAFETCH_RESET: true clears all stored hashes and re-crawls everything once. Remove this setting after the reset crawl.
spider.json
{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_RESET": true
  }
}

Reset Options

Delete All Hash Storage

rm -rf .scrapy/deltafetch/
Removes all DeltaFetch data for all spiders.

Delete Specific Spider’s Data

rm -rf .scrapy/deltafetch/<spider_name>/
Removes DeltaFetch data for one spider only.

One-Time Reset via Config

Set DELTAFETCH_RESET: true in spider settings. This re-crawls everything once, then remove the setting.

Complete Example

News Site with Daily Updates

news_spider.json
{
  "name": "dailynews",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/news"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    },
    {
      "allow": ["/news/"],
      "callback": null,
      "follow": true,
      "priority": 50
    }
  ],
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DOWNLOAD_DELAY": 1
  }
}
First crawl (Monday):
./scrapai crawl dailynews --project news
Scraped 500 articles (first crawl - all new)
Stored 500 content hashes
Second crawl (Tuesday):
./scrapai crawl dailynews --project news
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/article/old-1
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/article/old-2
...
Scraped 50 articles (only new/modified)
Skipped 450 unchanged articles

Blog with Weekly Updates

blog_spider.json
{
  "name": "techblog",
  "allowed_domains": ["techblog.com"],
  "start_urls": ["https://techblog.com/posts"],
  "rules": [
    {
      "allow": ["/post/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_DIR": ".scrapy/deltafetch/techblog"
  }
}
Weekly cron job:
# crontab entry
0 2 * * 1 cd /path/to/scrapai && ./scrapai crawl techblog --project blogs
Only new posts from the past week are scraped.

Combining with Other Features

DeltaFetch + Cloudflare Bypass

spider.json
{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "CLOUDFLARE_ENABLED": true,
    "CLOUDFLARE_STRATEGY": "hybrid"
  }
}
Skip unchanged pages while handling Cloudflare protection.

DeltaFetch + Sitemap

spider.json
{
  "settings": {
    "USE_SITEMAP": true,
    "DELTAFETCH_ENABLED": true
  }
}
Crawl sitemap URLs but skip unchanged pages.

DeltaFetch + Proxy

spider.json
{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}
./scrapai crawl myspider --project proj --proxy-type datacenter
Combine incremental crawling with smart proxy usage.

Monitoring

Check Log Output

Look for DeltaFetch debug messages:
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/page1
[scrapy_deltafetch] DEBUG: Ignoring already fetched: https://example.com/page2
These indicate pages being skipped.

Check Storage

# View size of hash database
ls -lh .scrapy/deltafetch/

# Example output:
drwxr-xr-x  3 user  staff    96B Feb 24 10:30 myspider/
# View spider-specific storage
ls -lh .scrapy/deltafetch/myspider/

# Example output:
-rw-r--r--  1 user  staff   128K Feb 24 10:30 hashes.db

Statistics

Scrape stats show skipped items:
'deltafetch/skipped': 450,
'item_scraped_count': 50,

Troubleshooting

Not Skipping Any Pages

1

Verify setting is enabled

Check DELTAFETCH_ENABLED: true in spider settings
2

Check if first crawl

First crawl never skips (nothing to compare against)Run crawl again to see skipping behavior
3

Verify storage directory exists

ls -la .scrapy/deltafetch/
If directory is empty, first crawl hasn’t completed yet
4

Check hash database has data

ls -lh .scrapy/deltafetch/<spider_name>/
File should have non-zero size

Skipping Pages That Should Be Re-Crawled

If you need to force re-crawl all pages, delete the hash database.
Option 1: Delete hash database
rm -rf .scrapy/deltafetch/<spider_name>/
Option 2: Use DELTAFETCH_RESET
{
  "settings": {
    "DELTAFETCH_ENABLED": true,
    "DELTAFETCH_RESET": true
  }
}
Run crawl once, then remove DELTAFETCH_RESET setting.

Pages Changed But Not Detected

Possible causes:
  1. Content hash unchanged
    • Minor changes (timestamps, ads) may not affect core content hash
    • DeltaFetch compares content body, not dynamic elements
  2. Cache issues
    • Clear hash database and re-crawl
  3. Spider extracts different content
    • Check if selectors are targeting correct content

Storage Growing Too Large

Check size:
du -sh .scrapy/deltafetch/<spider_name>/
Large storage indicates:
  • Many unique pages crawled (normal)
  • Consider periodic cleanup for old sites
Cleanup old data:
# Full reset
rm -rf .scrapy/deltafetch/<spider_name>/

# Or move to archive
mv .scrapy/deltafetch/<spider_name>/ .scrapy/deltafetch/<spider_name>.old/

Limitations

Important limitations:
  1. First crawl is always full
    • No prior hashes to compare against
    • All pages are scraped
  2. Detects content changes only
    • New pages are always crawled
    • Only existing page changes are detected
  3. Hash database is local
    • Not synced across machines
    • Each machine maintains separate hash storage
  4. Content-based detection
    • Minor metadata changes (timestamps) may not trigger re-crawl
    • Focuses on main content body

Use Cases

Scenario: Daily news site with 1000s of articlesBenefits:
  • First crawl: 5000 articles
  • Daily crawls: Only 50-100 new articles
  • 95-98% reduction in pages processed
Setup:
{
  "settings": {
    "DELTAFETCH_ENABLED": true
  }
}

Best Practices

1

Enable for recurring crawls

DeltaFetch is most useful for spiders that run repeatedly (daily, weekly, monthly)
2

Not needed for one-time crawls

If you only crawl a site once, DeltaFetch provides no benefit
3

Combine with scheduled crawls

Use with cron jobs or scheduled tasks:
# crontab entry - daily at 2 AM
0 2 * * * cd /path/to/scrapai && ./scrapai crawl myspider --project proj
4

Monitor storage growth

Periodically check and clean old hash databases
5

Test reset behavior

Use DELTAFETCH_RESET: true to test full re-crawl behavior

Performance Metrics

Example: News site with 5000 articles
CrawlArticles ScrapedArticles SkippedTime Saved
First50000-
Day 250495099%
Day 375492598.5%
Week 2200480096%
Typical efficiency gains:
  • Daily updates: 95-99% reduction in pages processed
  • Weekly updates: 90-95% reduction
  • Monthly updates: 80-90% reduction