Skip to main content
Every spider goes through 4 phases, executed sequentially and completely. The agent never skips steps—each phase builds on the previous one.
1

Phase 1: Analysis & Section Documentation

Understand site structure, discover all content sections, document URL patterns
2

Phase 2: Rule Generation & Extraction Testing

Create URL matching rules, choose extraction strategy, test on sample pages
3

Phase 3: Prepare Spider Configuration

Create test and final spider JSON files with all rules and settings
4

Phase 4: Execution & Verification

Test extraction quality on sample articles, import final spider for production
Only mark queue items complete when ALL phases pass. If any fail: ./scrapai queue fail <id> -m "reason".

Phase 1: Analysis & Section Documentation

Goal: Understand site structure, discover all content sections, document URL patterns.

For Non-Sitemap URLs

./scrapai inspect https://site.com/ --project proj
./scrapai extract-urls --file data/proj/spider/analysis/page.html --output data/proj/spider/analysis/all_urls.txt
Review all_urls.txt, categorize URLs (content/navigation/utility), drill into sections with inspect + analyze, document in sections.md.

Exclusion Policy

ONLY exclude:
  • About, contact, donate, account, legal, search pages
  • PDFs and non-HTML files
Everything else: explore and include. When uncertain, include it. User instructions always override defaults.

For Sitemap URLs

If the URL points to an XML sitemap (e.g., https://site.com/sitemap.xml):
  1. Inspect the sitemap to understand structure
  2. Identify URL patterns for content pages
  3. Use USE_SITEMAP: true in spider config
  4. See sitemap documentation for details

Phase 1 Complete When:

  • sections.md exists in data/<project>/<spider>/analysis/
  • ALL content section types identified (blog, news, reports, etc.)
  • URL pattern documented for EACH section type
  • Example URLs listed (minimum 3 per section) for Phase 2 testing
  • Exclusions documented

Phase 2: Rule Generation & Extraction Testing

Goal: Create URL matching rules, choose extraction strategy (generic extractors, custom selectors, or callbacks).

Decision Point: What Type of Content?

Articles/Blog Posts

Use parse_article with generic extractors (newspaper, trafilatura)

Products/Jobs/Listings

Use named callbacks with custom fields

For Article Content (title/content/author/date)

1

Create rules from sections.md

Use the URL patterns documented in Phase 1 to create rules for each section.
2

Test generic extractors

Inspect an article page and analyze its structure:
# Default: lightweight HTTP (works for most sites)
./scrapai inspect https://website.com/article-url --project proj

# Use --browser if site needs JavaScript
./scrapai inspect https://website.com/article-url --project proj --browser

# Use --browser if site is protected
./scrapai inspect https://website.com/article-url --project proj --browser

./scrapai analyze data/proj/spider/analysis/page.html
If it has clean <article> tags / semantic HTML → generic extractors work.
3

If generic extractors fail

Discover custom CSS selectors using ./scrapai analyze:
./scrapai analyze data/proj/spider/analysis/page.html
./scrapai analyze data/proj/spider/analysis/page.html --test "h1.article-title"
./scrapai analyze data/proj/spider/analysis/page.html --find "price"
See extractor documentation for selector discovery.
4

Consolidate into final_spider.json

Create the complete spider config with all rules and settings.

For Non-Article Content (products, jobs, etc.)

1

Analyze a sample page

./scrapai analyze data/proj/spider/analysis/page.html
2

Identify all fields to extract

For e-commerce: name, price, rating, availability, imagesFor jobs: title, company, salary, location, descriptionFor real estate: address, price, bedrooms, square footage, features
3

Discover CSS selectors for each field

./scrapai analyze data/proj/spider/analysis/page.html --test "h1.product-name::text"
./scrapai analyze data/proj/spider/analysis/page.html --find "price"
4

Create callback config

Build callback config with all fields + processors (CSS selectors, processors for cleaning/casting). See callbacks documentation.
5

Test on multiple example pages

Verify selectors work across 2-3 different items to ensure consistency.
6

Consolidate into final_spider.json

Create the complete spider config with callbacks section.

Phase 2 Complete When:

  • final_spider.json created with all URL matching rules
  • Extractor strategy chosen:
    • Generic extractors: EXTRACTOR_ORDER configured
    • Custom selectors: CUSTOM_SELECTORS for title, content, author, date
    • Named callbacks: callbacks dict with custom field extraction
  • All settings documented (Cloudflare, Playwright, etc. if needed)

Phase 3: Prepare Spider Configuration

Goal: Create test and final spider JSON files with all rules and settings. Test spider config (test_spider.json): 5 sample article URLs, follow: false on all rules (no crawling), same extractor settings as final. Final spider config (final_spider.json): Full start_urls, all URL matching rules with proper follow settings, complete extractor/callback configuration, all spider settings. Include source_url when processing from queue.
Do NOT import yet. Importing happens in Phase 4 after validation.

Phase 3 Complete When:

  • test_spider.json created with 5 article URLs, follow: false
  • final_spider.json created with all start_urls, rules, and settings
  • source_url included in config (if processing from queue)

Phase 4: Execution & Verification

Goal: Test extraction quality on sample articles, then import final spider for production.

Step 4A: Test Extraction (5 Articles)

1

Import test spider

./scrapai spiders import test_spider.json --project proj
2

Run test crawl

./scrapai crawl spider_name --limit 5 --project proj
Crawls exactly 5 URLs and saves results to database.
3

Verify output

./scrapai show spider_name --limit 5 --project proj
Check that all fields are extracted correctly:
  • Title present and accurate
  • Content complete (not truncated)
  • Author extracted (if available)
  • Date parsed correctly
4

Fix if needed

If extraction is bad:
  • Review selectors in Phase 2
  • Update test_spider.json
  • Re-import and re-test
Only proceed when extraction is good.

Step 4B: Import Final Spider

1

Import final spider

./scrapai spiders import final_spider.json --project proj
Using the same spider name auto-updates the existing config.
2

Spider is ready for production

The spider is now in the database and ready for full crawls.Do NOT run production crawls yourself as they can take hours or days.

Production Crawls (User Runs)

Agent always uses --limit 5 for testing. User runs production crawls without --limit. Production crawls can take hours/days.
If user asks for full crawl: Explain it can take hours/days, provide command: ./scrapai crawl <spider_name> --project <project_name>, mention checkpoint support (Ctrl+C to pause/resume).

Phase 4 Complete When:

  • Test crawl completed with --limit 5
  • show output verified: title, content, author, date extracted correctly
  • Final spider imported to database
  • Spider ready for production (user will run full crawl)

Settings Reference

  • Generic extractors: EXTRACTOR_ORDER: ["newspaper", "trafilatura"] - clean semantic HTML
  • Custom selectors: CUSTOM_SELECTORS with CSS selectors - when generic extractors fail
  • JavaScript-rendered: EXTRACTOR_ORDER: ["playwright", "custom"] with wait selectors
  • Cloudflare: CLOUDFLARE_ENABLED: true (test without --browser first) - see Cloudflare docs
  • Sitemap: USE_SITEMAP: true - see sitemap docs
  • DeltaFetch: DELTAFETCH_ENABLED: true - skip already-scraped URLs (80-90% bandwidth reduction)
  • Infinite scroll: INFINITE_SCROLL: true, MAX_SCROLLS: 5

Parallel Queue Processing

Max 5 websites in parallel (e.g., 12 websites → batches of 5+5+2). Each website goes through Phase 1→2→3→4 sequentially, but multiple websites can be at different phases. Report progress per batch. See queue documentation.

Next Steps

CLI Reference

Complete CLI command documentation

Extractors Guide

Learn about extraction strategies and selector discovery

Custom Callbacks

Custom field extraction for non-article content

Cloudflare Bypass

Bypass Cloudflare protection with cookie caching