Phase 1: Analysis & Section Documentation
Understand site structure, discover all content sections, document URL patterns
Phase 2: Rule Generation & Extraction Testing
Create URL matching rules, choose extraction strategy, test on sample pages
Phase 3: Prepare Spider Configuration
Create test and final spider JSON files with all rules and settings
./scrapai queue fail <id> -m "reason".
Phase 1: Analysis & Section Documentation
Goal: Understand site structure, discover all content sections, document URL patterns.For Non-Sitemap URLs
all_urls.txt, categorize URLs (content/navigation/utility), drill into sections with inspect + analyze, document in sections.md.
Exclusion Policy
ONLY exclude:- About, contact, donate, account, legal, search pages
- PDFs and non-HTML files
For Sitemap URLs
If the URL points to an XML sitemap (e.g.,https://site.com/sitemap.xml):
- Inspect the sitemap to understand structure
- Identify URL patterns for content pages
- Use
USE_SITEMAP: truein spider config - See sitemap documentation for details
Phase 1 Complete When:
sections.mdexists indata/<project>/<spider>/analysis/- ALL content section types identified (blog, news, reports, etc.)
- URL pattern documented for EACH section type
- Example URLs listed (minimum 3 per section) for Phase 2 testing
- Exclusions documented
Phase 2: Rule Generation & Extraction Testing
Goal: Create URL matching rules, choose extraction strategy (generic extractors, custom selectors, or callbacks).Decision Point: What Type of Content?
Articles/Blog Posts
Use
parse_article with generic extractors (newspaper, trafilatura)Products/Jobs/Listings
Use named callbacks with custom fields
For Article Content (title/content/author/date)
Create rules from sections.md
Use the URL patterns documented in Phase 1 to create rules for each section.
Test generic extractors
Inspect an article page and analyze its structure:If it has clean
<article> tags / semantic HTML → generic extractors work.If generic extractors fail
Discover custom CSS selectors using See extractor documentation for selector discovery.
./scrapai analyze:For Non-Article Content (products, jobs, etc.)
Identify all fields to extract
For e-commerce: name, price, rating, availability, imagesFor jobs: title, company, salary, location, descriptionFor real estate: address, price, bedrooms, square footage, features
Create callback config
Build callback config with all fields + processors (CSS selectors, processors for cleaning/casting). See callbacks documentation.
Test on multiple example pages
Verify selectors work across 2-3 different items to ensure consistency.
Phase 2 Complete When:
final_spider.jsoncreated with all URL matching rules- Extractor strategy chosen:
- Generic extractors:
EXTRACTOR_ORDERconfigured - Custom selectors:
CUSTOM_SELECTORSfor title, content, author, date - Named callbacks:
callbacksdict with custom field extraction
- Generic extractors:
- All settings documented (Cloudflare, Playwright, etc. if needed)
Phase 3: Prepare Spider Configuration
Goal: Create test and final spider JSON files with all rules and settings. Test spider config (test_spider.json): 5 sample article URLs, follow: false on all rules (no crawling), same extractor settings as final.
Final spider config (final_spider.json): Full start_urls, all URL matching rules with proper follow settings, complete extractor/callback configuration, all spider settings. Include source_url when processing from queue.
Phase 3 Complete When:
test_spider.jsoncreated with 5 article URLs,follow: falsefinal_spider.jsoncreated with all start_urls, rules, and settingssource_urlincluded in config (if processing from queue)
Phase 4: Execution & Verification
Goal: Test extraction quality on sample articles, then import final spider for production.Step 4A: Test Extraction (5 Articles)
Verify output
- Title present and accurate
- Content complete (not truncated)
- Author extracted (if available)
- Date parsed correctly
Step 4B: Import Final Spider
Production Crawls (User Runs)
If user asks for full crawl: Explain it can take hours/days, provide command:./scrapai crawl <spider_name> --project <project_name>, mention checkpoint support (Ctrl+C to pause/resume).
Phase 4 Complete When:
- Test crawl completed with
--limit 5 showoutput verified: title, content, author, date extracted correctly- Final spider imported to database
- Spider ready for production (user will run full crawl)
Settings Reference
- Generic extractors:
EXTRACTOR_ORDER: ["newspaper", "trafilatura"]- clean semantic HTML - Custom selectors:
CUSTOM_SELECTORSwith CSS selectors - when generic extractors fail - JavaScript-rendered:
EXTRACTOR_ORDER: ["playwright", "custom"]with wait selectors - Cloudflare:
CLOUDFLARE_ENABLED: true(test without--browserfirst) - see Cloudflare docs - Sitemap:
USE_SITEMAP: true- see sitemap docs - DeltaFetch:
DELTAFETCH_ENABLED: true- skip already-scraped URLs (80-90% bandwidth reduction) - Infinite scroll:
INFINITE_SCROLL: true, MAX_SCROLLS: 5
Parallel Queue Processing
Max 5 websites in parallel (e.g., 12 websites → batches of 5+5+2). Each website goes through Phase 1→2→3→4 sequentially, but multiple websites can be at different phases. Report progress per batch. See queue documentation.Next Steps
CLI Reference
Complete CLI command documentation
Extractors Guide
Learn about extraction strategies and selector discovery
Custom Callbacks
Custom field extraction for non-article content
Cloudflare Bypass
Bypass Cloudflare protection with cookie caching