4-Phase Workflow

Every spider goes through 4 phases, executed sequentially and completely. The agent never skips steps—each phase builds on the previous one.

This page summarizes the workflow. The authoritative, always-current version is CLAUDE.md on main — the exact instructions the agent runs.

Phase 1: Analysis & Section Documentation

Understand site structure, discover all content sections, document URL patterns

Phase 2: Rule Generation & Extraction Testing

Create URL matching rules, choose extraction strategy, test on sample pages

Phase 3: Prepare Spider Configuration

Create test and final spider JSON files with all rules and settings

Phase 4: Execution & Verification

Test extraction quality on sample articles, import final spider for production

Only mark queue items complete when ALL phases pass. If any fail: ./scrapai queue fail <id> -m "reason".

Phase 1: Analysis & Section Documentation

Goal: Understand site structure, discover all content sections, document URL patterns.

For Non-Sitemap URLs

./scrapai inspect https://site.com/ --project proj
./scrapai extract-urls --file data/proj/spider/analysis/page.html --output data/proj/spider/analysis/all_urls.txt

Review all_urls.txt, categorize URLs (content/navigation/utility), drill into sections with inspect + analyze, document in sections.md.

Exclusion Policy

ONLY exclude:

About, contact, donate, account, legal, search pages
PDFs and non-HTML files

Everything else: explore and include. When uncertain, include it. User instructions always override defaults.

For Sitemap URLs

If the URL points to an XML sitemap (e.g., https://site.com/sitemap.xml):

Inspect the sitemap to understand structure
Identify URL patterns for content pages
Use USE_SITEMAP: true in spider config
See sitemap documentation for details

Phase 1 Complete When:

sections.md exists in data/<project>/<spider>/analysis/
ALL content section types identified (blog, news, reports, etc.)
URL pattern documented for EACH section type
Example URLs listed (minimum 3 per section) for Phase 2 testing
Exclusions documented

Phase 2: Rule Generation & Extraction Testing

Goal: Create URL matching rules, choose extraction strategy (generic extractors, custom selectors, or callbacks).

Decision Point: What Type of Content?

Articles/Blog Posts

Use parse_article with generic extractors (newspaper, trafilatura)

Products/Jobs/Listings

Use named callbacks with custom fields

For Article Content (title/content/author/date)

Create rules from sections.md

Use the URL patterns documented in Phase 1 to create rules for each section.

Test generic extractors

Inspect an article page and analyze its structure:

# Default: lightweight HTTP (works for most sites)
./scrapai inspect https://website.com/article-url --project proj

# Use --browser if site needs JavaScript
./scrapai inspect https://website.com/article-url --project proj --browser

# Use --browser if site is protected
./scrapai inspect https://website.com/article-url --project proj --browser

./scrapai analyze data/proj/spider/analysis/page.html

If it has clean <article> tags / semantic HTML → generic extractors work.

If generic extractors fail

Discover custom CSS selectors using ./scrapai analyze:

./scrapai analyze data/proj/spider/analysis/page.html
./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --find "price"

See extractor documentation for selector discovery.

Consolidate into final_spider.json

Create the complete spider config with all rules and settings.

For Non-Article Content (products, jobs, etc.)

Analyze a sample page

./scrapai analyze data/proj/spider/analysis/page.html

Identify all fields to extract

For e-commerce: name, price, rating, availability, imagesFor jobs: title, company, salary, location, descriptionFor real estate: address, price, bedrooms, square footage, features

Discover CSS selectors for each field

./scrapai analyze data/proj/spider/analysis/page.html --test "h1.product-name::text"
./scrapai analyze data/proj/spider/analysis/page.html --find "price"

Create callback config

Build callback config with all fields + processors (CSS selectors, processors for cleaning/casting). See callbacks documentation.

Test on multiple example pages

Verify selectors work across 2-3 different items to ensure consistency.

Consolidate into final_spider.json

Create the complete spider config with callbacks section.

Phase 2 Complete When:

final_spider.json created with all URL matching rules
Extractor strategy chosen:
- Generic extractors: EXTRACTOR_ORDER configured
- Custom selectors: CUSTOM_SELECTORS for title, content, author, date
- Named callbacks: callbacks dict with custom field extraction
All settings documented (Cloudflare, Playwright, etc. if needed)

Phase 3: Prepare Spider Configuration

Goal: Create test and final spider JSON files with all rules and settings. Test spider config (test_spider.json): 5 sample article URLs, follow: false on all rules (no crawling), same extractor settings as final. Final spider config (final_spider.json): Full start_urls, all URL matching rules with proper follow settings, complete extractor/callback configuration, all spider settings. Include source_url when processing from queue.

Do NOT import yet. Importing happens in Phase 4 after validation.

Phase 3 Complete When:

test_spider.json created with 5 article URLs, follow: false
final_spider.json created with all start_urls, rules, and settings
source_url included in config (if processing from queue)

Phase 4: Execution & Verification

Goal: Test extraction quality on sample articles, then import final spider for production.

Step 4A: Test Extraction (5 Articles)

Import test spider

./scrapai spiders import test_spider.json --project proj

Run test crawl

./scrapai crawl spider_name --limit 5 --project proj

Crawls exactly 5 URLs and saves results to database.

Verify output

./scrapai show spider_name --limit 5 --project proj

Check that all fields are extracted correctly:

Title present and accurate
Content complete (not truncated)
Author extracted (if available)
Date parsed correctly

Fix if needed

If extraction is bad:

Review selectors in Phase 2
Update test_spider.json
Re-import and re-test

Only proceed when extraction is good.

Step 4B: Import Final Spider

Import final spider

./scrapai spiders import final_spider.json --project proj

Using the same spider name auto-updates the existing config.

Spider is ready for production

The spider is now in the database and ready for full crawls.Do NOT run production crawls yourself as they can take hours or days.

Production Crawls (User Runs)

Agent always uses --limit 5 for testing. User runs production crawls without --limit. Production crawls can take hours/days.

If user asks for full crawl: Explain it can take hours/days, provide command: ./scrapai crawl <spider_name> --project <project_name>, mention checkpoint support (Ctrl+C to pause/resume).

Phase 4 Complete When:

Test crawl completed with --limit 5
show output verified: title, content, author, date extracted correctly
Final spider imported to database
Spider ready for production (user will run full crawl)

Settings Reference

Generic extractors: EXTRACTOR_ORDER: ["newspaper", "trafilatura"] - clean semantic HTML
Custom selectors: CUSTOM_SELECTORS with CSS selectors - when generic extractors fail
JavaScript-rendered: EXTRACTOR_ORDER: ["playwright", "custom"] with wait selectors
Cloudflare: CLOUDFLARE_ENABLED: true (test without --browser first) - see Cloudflare docs
Sitemap: USE_SITEMAP: true - see sitemap docs
DeltaFetch: DELTAFETCH_ENABLED: true - skip already-scraped URLs (80-90% bandwidth reduction)
Infinite scroll: INFINITE_SCROLL: true, MAX_SCROLLS: 5

Parallel Queue Processing

Max 5 websites in parallel (e.g., 12 websites → batches of 5+5+2). Each website goes through Phase 1→2→3→4 sequentially, but multiple websites can be at different phases. Report progress per batch. See queue documentation.

Next Steps

CLI Reference

Complete CLI command documentation

Extractors Guide

Learn about extraction strategies and selector discovery

Custom Callbacks

Custom field extraction for non-article content

Cloudflare Bypass

Bypass Cloudflare protection with cookie caching

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

4-Phase Workflow

Phase 1: Analysis & Section Documentation

For Non-Sitemap URLs

Exclusion Policy

For Sitemap URLs

Phase 1 Complete When:

Phase 2: Rule Generation & Extraction Testing

Decision Point: What Type of Content?

Articles/Blog Posts

Products/Jobs/Listings

For Article Content (title/content/author/date)

For Non-Article Content (products, jobs, etc.)

Phase 2 Complete When:

Phase 3: Prepare Spider Configuration

Phase 3 Complete When:

Phase 4: Execution & Verification

Step 4A: Test Extraction (5 Articles)

Step 4B: Import Final Spider

Production Crawls (User Runs)

Phase 4 Complete When:

Settings Reference

Parallel Queue Processing

Next Steps

CLI Reference

Extractors Guide

Custom Callbacks

Cloudflare Bypass

​Phase 1: Analysis & Section Documentation

​For Non-Sitemap URLs

​Exclusion Policy

​For Sitemap URLs

​Phase 1 Complete When:

​Phase 2: Rule Generation & Extraction Testing

​Decision Point: What Type of Content?

Articles/Blog Posts

Products/Jobs/Listings

​For Article Content (title/content/author/date)

​For Non-Article Content (products, jobs, etc.)

​Phase 2 Complete When:

​Phase 3: Prepare Spider Configuration

​Phase 3 Complete When:

​Phase 4: Execution & Verification

​Step 4A: Test Extraction (5 Articles)

​Step 4B: Import Final Spider

​Production Crawls (User Runs)

​Phase 4 Complete When:

​Settings Reference

​Parallel Queue Processing

​Next Steps

CLI Reference

Extractors Guide

Custom Callbacks

Cloudflare Bypass

Phase 1: Analysis & Section Documentation

For Non-Sitemap URLs

Exclusion Policy

For Sitemap URLs

Phase 1 Complete When:

Phase 2: Rule Generation & Extraction Testing

Decision Point: What Type of Content?

For Article Content (title/content/author/date)

For Non-Article Content (products, jobs, etc.)

Phase 2 Complete When:

Phase 3: Prepare Spider Configuration

Phase 3 Complete When:

Phase 4: Execution & Verification

Step 4A: Test Extraction (5 Articles)

Step 4B: Import Final Spider

Production Crawls (User Runs)

Phase 4 Complete When:

Settings Reference

Parallel Queue Processing

Next Steps