Overview
Migrating to ScrapAI means converting your Python scraping code into JSON configs. The process:- Analyze existing code to understand extraction logic
- Map to ScrapAI concepts (rules, extractors, callbacks)
- Generate JSON config with equivalent behavior
- Test and verify extraction quality
- Deploy to database and retire old code
Why Migrate?
FromREADME.md:170-179:
Your existing scrapers keep running while you verify. No big bang migration required.Benefits:
- Database-first management: Change settings across 100 spiders with one SQL query
- Uniform structure: Consistent schema, validation, naming conventions
- Built-in features: Cloudflare bypass, checkpoint, proxy escalation, incremental crawling
- Easy to review: JSON configs are easier to audit than Python code
- AI-assisted updates: Point an agent at a broken spider to auto-fix extraction rules
Migration Workflow
Using an AI agent (Claude Code, Cursor, etc.):Manual Migration
For direct control:- Read your existing spider code
- Extract URL patterns, selectors, and extraction logic
- Write equivalent JSON config (see examples below)
- Import:
./scrapai spiders import config.json --project myproject - Test:
./scrapai crawl spider_name --project myproject --limit 5 - Compare output with original spider
- Iterate until quality matches
Scrapy Spider Migration
Original Scrapy Spider
scrapy_spider.py
Equivalent ScrapAI Config
bbc_config.json
Key Mappings
| Scrapy Concept | ScrapAI Equivalent |
|---|---|
name | name |
allowed_domains | allowed_domains |
start_urls | start_urls |
LinkExtractor(allow=...) | rules[].allow |
LinkExtractor(deny=...) | rules[].deny |
Rule(follow=True) | rules[].follow: true |
Rule(callback='parse') | rules[].callback: "parse" |
response.css('selector::text').get() | css: "selector::text" |
response.css('selector::text').getall() | css: "selector::text", get_all: true |
response.xpath('//div') | xpath: "//div" |
' '.join(texts) | processors: [{"type": "join"}] |
BeautifulSoup Migration
Original BeautifulSoup Script
bs4_scraper.py
Equivalent ScrapAI Config
products_config.json
Key Differences
ScrapAI provides automatic request handling, link extraction, retries, rate limiting, and JSONL export - eliminating manual boilerplate code.Scrapling Migration
Migrate from Scrapling when managing 10+ sites with similar structure. Keep Scrapling for single sites with complex interactions, heavy JavaScript, or login flows.Example Migration
scrapling_script.py
hn_config.json
Processors for Data Cleaning
Fromcore/schemas.py:131-156:
Common Processor Patterns
Strip whitespace:Validation During Migration
All configs are validated before import:- Spider names: Alphanumeric characters, underscores, and hyphens only
- URLs: HTTP/HTTPS only, with SSRF protection (blocks localhost and private IPs)
- Callbacks: Must be valid Python identifiers, cannot use reserved names (
parse,parse_article, etc.)
Testing After Migration
Compare Output Quality
Verify Extraction Rules
Performance Comparison
Incremental Migration Strategy
Phase 1: Pilot (1-2 weeks)
- Pick 3-5 representative spiders
- Migrate to ScrapAI
- Run both old and new in parallel
- Compare output quality
- Tune extraction rules until quality matches
Phase 2: Batch Migration (2-4 weeks)
- Group remaining spiders by similarity
- Migrate one group at a time
- Reuse patterns from pilot spiders
- Test each batch before moving to next
Phase 3: Cutover (1 week)
- Switch production traffic to ScrapAI
- Keep old spiders as backup for 1 month
- Monitor error rates and data quality
- Retire old code once confident
Phase 4: Optimization (ongoing)
- Tune DOWNLOAD_DELAY and CONCURRENT_REQUESTS
- Enable DeltaFetch for incremental crawling
- Set up Airflow for scheduling
- Add custom callbacks for edge cases
Common Pitfalls
Regex Patterns: Copy patterns directly from Scrapy, then test with./scrapai crawl --limit 5.
Relative URLs: Scrapy handles urljoin() automatically. Use css: "a::attr(href)".
Custom Middleware:
- Proxy rotation: Use built-in proxy escalation
- Cloudflare: Enable
CLOUDFLARE_ENABLED: true - Custom headers: Add to spider settings
See Also
Custom Callbacks
Write custom extraction logic for complex sites
Security
Understanding config validation and SSRF protection