Skip to main content
ScrapAI automatically enables checkpoint support for production crawls, allowing you to pause long-running crawls and resume them later without losing progress.

How It Works

1

Automatic for production crawls

Checkpoint is automatically enabled when running production crawls (no --limit flag)
2

Press Ctrl+C to pause

Checkpoint saved automatically when interrupted
3

Run same command to resume

Automatically detects checkpoint and resumes from where you left off
4

Automatic cleanup on success

Checkpoint deleted automatically on successful completion
Test crawls (with --limit) do not use checkpoints since they’re short-running.

What Gets Saved

Scrapy’s JOBDIR feature saves:
  1. Pending requests - All URLs waiting to be crawled
  2. Duplicates filter - URLs already visited (prevents re-crawling)
  3. Spider state - Any custom state stored in spider.state dict

Usage

Production Crawl with Checkpoint

# Start production crawl (checkpoint auto-enabled)
./scrapai crawl myspider --project myproject
Console output:
💾 Checkpoint enabled: ./data/myproject/myspider/checkpoint
Press Ctrl+C to pause, run same command to resume
Pause the crawl:
# Press Ctrl+C
^C
Resume later:
# Run same command
./scrapai crawl myspider --project myproject
# Automatically detects checkpoint and resumes

Test Crawl (No Checkpoint)

# Test mode - no checkpoint needed (short run)
./scrapai crawl myspider --project myproject --limit 10
Console output:
🧪 Test mode: Saving to database (limit: 10 items)

Checkpoint Storage

Checkpoints are stored in your DATA_DIR:
DATA_DIR/<project>/<spider>/checkpoint/
Example directory structure:
./data/myproject/myspider/
├── analysis/        # Phase 1-3 files
├── crawls/          # Production outputs
├── exports/         # Database exports
└── checkpoint/      # Checkpoint state (auto-cleaned on success)

Automatic Cleanup

Automatic cleanup on successful completion:
  • When spider completes successfully (no Ctrl+C), checkpoint directory is automatically deleted
  • Saves disk space
  • Only failed/interrupted crawls keep checkpoints
Manual cleanup:
# If you want to discard a checkpoint and start fresh
rm -rf ./data/myproject/myspider/checkpoint/

Complete Example

1

Start production crawl

./scrapai crawl techcrunch --project news
Output:
💾 Checkpoint enabled: ./data/news/techcrunch/checkpoint
Press Ctrl+C to pause, run same command to resume

Crawling: https://techcrunch.com/
Scraped 50 items...
Scraped 100 items...
2

Pause crawl (Ctrl+C)

^C
Output:
Received interrupt signal, shutting down...
Checkpoint saved: 150 items scraped, 237 URLs pending
Run same command to resume from checkpoint
3

Check checkpoint exists

ls -la ./data/news/techcrunch/checkpoint/
Output:
drwxr-xr-x  5 user  staff   160 Feb 24 10:30 .
drwxr-xr-x  7 user  staff   224 Feb 24 10:15 ..
-rw-r--r--  1 user  staff  4096 Feb 24 10:30 requests.queue
-rw-r--r--  1 user  staff  8192 Feb 24 10:30 dupefilter.db
-rw-r--r--  1 user  staff   512 Feb 24 10:30 spider.state
4

Resume crawl

./scrapai crawl techcrunch --project news
Output:
♻️  Resuming from checkpoint: 150 items already scraped, 237 URLs pending

Continuing crawl...
Scraped 160 items...
Scraped 200 items...
Crawl completed successfully!

🧹 Checkpoint cleaned up (crawl completed)

Limitations

Request callbacks must be spider methods (Scrapy limitation):
# ✅ Works (spider method)
Request(url, callback=self.parse_article)

# ❌ Won't work (external function)
Request(url, callback=some_external_function)
ScrapAI spiders already compatible: Our database spiders use spider methods (self.parse), so checkpoints work out of the box!
Other limitations:
  • Cookie expiration: If you wait too long to resume (days/weeks), cookies may expire and requests may fail. Resume within a reasonable timeframe (hours/days, not weeks).
  • Multiple runs: Each spider should have only one checkpoint at a time. Don’t run the same spider concurrently while a checkpoint exists.
  • Proxy type changes: If you change --proxy-type when resuming, the checkpoint is automatically cleared (see below).

Proxy Type Changes (Expert-in-the-Loop)

If you change --proxy-type when resuming, the checkpoint is automatically cleared and crawl starts fresh.
Example scenario:
1

Start crawl with auto mode

./scrapai crawl myspider --project proj
Uses datacenter proxies (auto mode default)
2

Datacenter fails, get expert prompt

⚠️  EXPERT-IN-THE-LOOP: Datacenter proxy failed
🏠 To use residential proxy, run:
  ./scrapai crawl myspider --project proj --proxy-type residential
Press Ctrl+C to pause
3

Resume with residential proxy

./scrapai crawl myspider --project proj --proxy-type residential
Output:
⚠️  Proxy type changed: auto → residential
🗑️  Clearing checkpoint to ensure all URLs retried with residential proxy
♻️  Starting fresh crawl
Why checkpoint is cleared:
  • Ensures blocked URLs are retried with new proxy type
  • Prevents Scrapy’s dupefilter from skipping already-seen failed URLs
  • Simpler and safer than complex retry logic
  • User explicitly chose expensive residential proxy, accepts comprehensive re-crawl

When Checkpoints Are Useful

Long-running crawls (hours/days) Resume if interruptedUnstable connections Resume after network failuresSystem maintenance Pause before server restart, resume afterResource management Pause during high-load periods, resume later

Technical Details

Built on Scrapy’s JOBDIR:
  • Uses Scrapy’s native pause/resume feature (not custom implementation)
  • Checkpoint files are pickle-serialized Scrapy objects
  • Atomic writes prevent checkpoint corruption
  • Compatible with all Scrapy spiders
Directory per spider:
  • Each spider gets its own checkpoint directory
  • Prevents conflicts between spiders
  • Clean separation of state
Smart cleanup:
  • Exit code 0 (success) → cleanup checkpoint
  • Exit code != 0 (error/Ctrl+C) → keep checkpoint for resume

Troubleshooting

Checkpoint Not Resuming

1

Check if checkpoint exists

ls -la ./data/myproject/myspider/checkpoint/
If directory doesn’t exist:
  • Checkpoint was cleaned up (successful completion)
  • Or never created (test mode with --limit)
2

Verify same command

Must use exact same command to resume:
# Original
./scrapai crawl myspider --project proj --proxy-type datacenter

# Resume (same command)
./scrapai crawl myspider --project proj --proxy-type datacenter
3

Check for proxy type change

Changing --proxy-type clears checkpoint automatically

Start Fresh (Discard Checkpoint)

# Delete checkpoint directory
rm -rf ./data/myproject/myspider/checkpoint/

# Run crawl again
./scrapai crawl myspider --project myproject

Checkpoint from Old Spider Version

If you updated spider rules/selectors significantly, old checkpoint may be incompatible.
Solution:
# Delete checkpoint and start fresh
rm -rf ./data/myproject/myspider/checkpoint/
./scrapai crawl myspider --project myproject

Checkpoint Files Too Large

Check size:
du -sh ./data/myproject/myspider/checkpoint/
Large checkpoints indicate:
  • Many pending URLs (normal for large crawls)
  • Consider crawling in smaller batches
  • Or use incremental crawling (DeltaFetch)