Pause long-running crawls and resume them later without losing progress
ScrapAI automatically enables checkpoint support for production crawls, allowing you to pause long-running crawls and resume them later without losing progress.
Received interrupt signal, shutting down...Checkpoint saved: 150 items scraped, 237 URLs pendingRun same command to resume from checkpoint
3
Check checkpoint exists
ls -la ./data/news/techcrunch/checkpoint/
Output:
drwxr-xr-x 5 user staff 160 Feb 24 10:30 .drwxr-xr-x 7 user staff 224 Feb 24 10:15 ..-rw-r--r-- 1 user staff 4096 Feb 24 10:30 requests.queue-rw-r--r-- 1 user staff 8192 Feb 24 10:30 dupefilter.db-rw-r--r-- 1 user staff 512 Feb 24 10:30 spider.state
Request callbacks must be spider methods (Scrapy limitation):
# ✅ Works (spider method)Request(url, callback=self.parse_article)# ❌ Won't work (external function)Request(url, callback=some_external_function)
✅ ScrapAI spiders already compatible: Our database spiders use spider methods (self.parse), so checkpoints work out of the box!
Other limitations:
Cookie expiration: If you wait too long to resume (days/weeks), cookies may expire and requests may fail. Resume within a reasonable timeframe (hours/days, not weeks).
Multiple runs: Each spider should have only one checkpoint at a time. Don’t run the same spider concurrently while a checkpoint exists.
Proxy type changes: If you change --proxy-type when resuming, the checkpoint is automatically cleared (see below).
✅ Long-running crawls (hours/days)
Resume if interrupted✅ Unstable connections
Resume after network failures✅ System maintenance
Pause before server restart, resume after✅ Resource management
Pause during high-load periods, resume later
❌ Short test crawls (minutes)
Not needed, checkpoints disabled❌ Quick prototyping
Use --limit flag, no checkpoints