Phase 1: Analysis & Section Documentation
Understand site structure, discover all content sections, document URL patterns
Phase 2: Rule Generation & Extraction Testing
Create URL matching rules, choose extraction strategy, test on sample pages
Phase 3: Prepare Spider Configuration
Create test and final spider JSON files with all rules and settings
./scrapai queue fail <id> -m "reason".
Phase 1: Analysis & Section Documentation
Goal: Understand site structure, discover all content sections, document URL patterns.For Non-Sitemap URLs
Read and categorize URLs
Review
all_urls.txt and categorize:- Content pages: Articles, blog posts, news items
- Navigation pages: Category pages, section indexes
- Utility pages: About, contact, search, account
Exclusion Policy
ONLY exclude:- About, contact, donate, account, legal, search pages
- PDFs and non-HTML files
For Sitemap URLs
If the URL points to an XML sitemap (e.g.,https://site.com/sitemap.xml):
- Inspect the sitemap to understand structure
- Identify URL patterns for content pages
- Use
USE_SITEMAP: truein spider config - See sitemap documentation for details
Phase 1 Complete When:
sections.mdexists indata/<project>/<spider>/analysis/- ALL content section types identified (blog, news, reports, etc.)
- URL pattern documented for EACH section type
- Example URLs listed (minimum 3 per section) for Phase 2 testing
- Exclusions documented
Phase 2: Rule Generation & Extraction Testing
Goal: Create URL matching rules, choose extraction strategy (generic extractors, custom selectors, or callbacks).Decision Point: What Type of Content?
Articles/Blog Posts
Use
parse_article with generic extractors (newspaper, trafilatura)Products/Jobs/Listings
Use named callbacks with custom fields
For Article Content (title/content/author/date)
Create rules from sections.md
Use the URL patterns documented in Phase 1 to create rules for each section.
Test generic extractors
Inspect an article page and analyze its structure:If it has clean
<article> tags / semantic HTML → generic extractors work.If generic extractors fail
Discover custom CSS selectors using See extractor documentation for selector discovery.
./scrapai analyze:For Non-Article Content (products, jobs, etc.)
Identify all fields to extract
For e-commerce: name, price, rating, availability, imagesFor jobs: title, company, salary, location, descriptionFor real estate: address, price, bedrooms, square footage, features
Test on multiple example pages
Verify selectors work across 2-3 different items to ensure consistency.
Phase 2 Complete When:
final_spider.jsoncreated with all URL matching rules- Extractor strategy chosen:
- Generic extractors:
EXTRACTOR_ORDERconfigured - Custom selectors:
CUSTOM_SELECTORSfor title, content, author, date - Named callbacks:
callbacksdict with custom field extraction
- Generic extractors:
- All settings documented (Cloudflare, Playwright, etc. if needed)
Phase 3: Prepare Spider Configuration
Goal: Create test and final spider JSON files with all rules and settings.Test Spider Config
Createtest_spider.json with:
- 5 sample article URLs (not full start_urls)
follow: falseon all rules (no crawling, just extraction testing)- Same extractor settings as final config
Final Spider Config
Createfinal_spider.json with:
- Full start_urls (homepage, section pages)
- All URL matching rules with proper
followsettings - Complete extractor/callback configuration
- All spider settings (delays, concurrency, etc.)
source_url when processing from queue:
Phase 3 Complete When:
test_spider.jsoncreated with 5 article URLs,follow: falsefinal_spider.jsoncreated with all start_urls, rules, and settingssource_urlincluded in config (if processing from queue)
Phase 4: Execution & Verification
Goal: Test extraction quality on sample articles, then import final spider for production.Step 4A: Test Extraction (5 Articles)
Verify output
- Title present and accurate
- Content complete (not truncated)
- Author extracted (if available)
- Date parsed correctly
Step 4B: Import Final Spider
Production Crawls (User Runs)
Testing (agent runs this):- Export to
DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl - Enable checkpoint (Ctrl+C to pause, resume with same command)
- Can take hours or days for large sites
- Explain: “Full crawls can take hours/days. I can’t run this for you as it would block our session.”
- Provide the exact command:
- Tell them:
- Crawl output will be exported to
DATA_DIR/<project>/<spider>/crawls/crawl_TIMESTAMP.jsonl - Checkpoint is enabled - press Ctrl+C to pause, run same command to resume
- Crawl output will be exported to
Phase 4 Complete When:
- Test crawl completed with
--limit 5 showoutput verified: title, content, author, date extracted correctly- Final spider imported to database
- Spider ready for production (user will run full crawl)
Settings Reference
Generic Extractors (Default)
<article>, <time>, etc.).
Custom Selectors
JavaScript-Rendered Sites
Cloudflare Bypass
Hybrid mode (default, 20-100x faster):Sitemap Spider
DeltaFetch (Incremental Crawling)
Infinite Scroll
Parallel Queue Processing
When processing multiple websites from the queue, the agent can work in parallel:Phases within each website are sequential
Each website goes through Phase 1→2→3→4 in order, but multiple websites can be at different phases simultaneously
Common Pitfalls
Skipping phases
Skipping phases
Never skip phases. Each phase builds on the previous one. If you skip Phase 1, you won’t have URL patterns for Phase 2. If you skip Phase 2, you won’t have extraction rules for Phase 3.Always complete 1→2→3→4 sequentially.
Running commands too fast
Running commands too fast
Run commands ONE AT A TIME. Never chain with Good:
&&. Read the output before proceeding to the next command.Bad:Forgetting --project flag
Forgetting --project flag
ALWAYS use
--project <name> on ALL spider, queue, crawl, show, and export commands.Without it, the command will fail or use the wrong project.Running production crawls
Running production crawls
NEVER run
crawl without --limit flag.Production crawls can take hours or days. You MUST NOT run them directly. Always use --limit 5 for testing.Reading HTML files directly
Reading HTML files directly
NEVER use Read/Grep on HTML files. Always use
./scrapai analyze.The analyzer provides structured output, selector testing, and field discovery. Raw HTML is hard to parse and easy to misinterpret.