Overview
Migrating to ScrapAI means converting your Python scraping code into JSON configs. The process:- Analyze existing code to understand extraction logic
- Map to ScrapAI concepts (rules, extractors, callbacks)
- Generate JSON config with equivalent behavior
- Test and verify extraction quality
- Deploy to database and retire old code
Why Migrate?
FromREADME.md:170-179:
Your existing scrapers keep running while you verify. No big bang migration required.Benefits:
- Database-first management: Change settings across 100 spiders with one SQL query
- Uniform structure: Consistent schema, validation, naming conventions
- Built-in features: Cloudflare bypass, checkpoint, proxy escalation, incremental crawling
- Easy to review: JSON configs are easier to audit than Python code
- AI-assisted updates: Point an agent at a broken spider to auto-fix extraction rules
Migration Workflow
Using an AI agent (Claude Code, Cursor, etc.):Manual Migration
For direct control:- Read your existing spider code
- Extract URL patterns, selectors, and extraction logic
- Write equivalent JSON config (see examples below)
- Import:
./scrapai spiders import config.json --project myproject - Test:
./scrapai crawl spider_name --project myproject --limit 5 - Compare output with original spider
- Iterate until quality matches
Scrapy Spider Migration
Original Scrapy Spider
scrapy_spider.py
Equivalent ScrapAI Config
bbc_config.json
Key Mappings
| Scrapy Concept | ScrapAI Equivalent |
|---|---|
name | name |
allowed_domains | allowed_domains |
start_urls | start_urls |
LinkExtractor(allow=...) | rules[].allow |
LinkExtractor(deny=...) | rules[].deny |
Rule(follow=True) | rules[].follow: true |
Rule(callback='parse') | rules[].callback: "parse" |
response.css('selector::text').get() | css: "selector::text" |
response.css('selector::text').getall() | css: "selector::text", get_all: true |
response.xpath('//div') | xpath: "//div" |
' '.join(texts) | processors: [{"type": "join"}] |
BeautifulSoup Migration
Original BeautifulSoup Script
bs4_scraper.py
Equivalent ScrapAI Config
products_config.json
Key Differences
BeautifulSoup:- Manual HTTP requests
- Manual link extraction
- Manual JSON export
- No retry logic
- No rate limiting
- Scrapy handles requests (retries, delays, middleware)
- Automatic link extraction via rules
- Automatic JSONL export
- Built-in retry and error handling
- Configurable rate limiting
Scrapling Migration
FromREADME.md:32:
For single-site scraping with fine-grained control, use Scrapling. ScrapAI is for multi-site fleets.When to migrate from Scrapling:
- You have 10+ sites to scrape
- Sites have similar structure (e.g., all news sites)
- You want database-driven management
- You need scheduling and monitoring
- Single site with complex interaction
- Heavy JavaScript rendering
- Fine-grained control needed
- Login/auth flows
Example Migration
scrapling_script.py
hn_config.json
Processors for Data Cleaning
Fromcore/schemas.py:131-156:
Common Processor Patterns
Strip whitespace:Validation During Migration
All configs go through strict validation before import. Fromcore/schemas.py:215-402:
Spider Name Validation
URL Validation (SSRF Protection)
Callback Validation
Testing After Migration
Compare Output Quality
Verify Extraction Rules
Performance Comparison
Incremental Migration Strategy
Phase 1: Pilot (1-2 weeks)
- Pick 3-5 representative spiders
- Migrate to ScrapAI
- Run both old and new in parallel
- Compare output quality
- Tune extraction rules until quality matches
Phase 2: Batch Migration (2-4 weeks)
- Group remaining spiders by similarity
- Migrate one group at a time
- Reuse patterns from pilot spiders
- Test each batch before moving to next
Phase 3: Cutover (1 week)
- Switch production traffic to ScrapAI
- Keep old spiders as backup for 1 month
- Monitor error rates and data quality
- Retire old code once confident
Phase 4: Optimization (ongoing)
- Tune DOWNLOAD_DELAY and CONCURRENT_REQUESTS
- Enable DeltaFetch for incremental crawling
- Set up Airflow for scheduling
- Add custom callbacks for edge cases
Common Pitfalls
Regex Patterns
Problem: Scrapy uses Python regex, ScrapAI uses the same. Solution: Copy patterns directly, but test with./scrapai crawl --limit 5.
Relative vs. Absolute URLs
Problem: Old spider might useurljoin() for relative URLs.
Solution: Scrapy handles this automatically. Just use css: "a::attr(href)".
Custom Middleware
Problem: Old spider uses custom Scrapy middleware. Solution:- Proxy rotation: Use ScrapAI’s built-in proxy escalation
- Cloudflare: Enable
CLOUDFLARE_ENABLED: true - Custom headers: Add to spider settings
- Other middleware: May require framework changes (contribute!)