Skip to main content
Migrate existing scrapers from Scrapy, BeautifulSoup, Scrapling, or any Python scraping framework to ScrapAI’s database-driven architecture.

Overview

Migrating to ScrapAI means converting your Python scraping code into JSON configs. The process:
  1. Analyze existing code to understand extraction logic
  2. Map to ScrapAI concepts (rules, extractors, callbacks)
  3. Generate JSON config with equivalent behavior
  4. Test and verify extraction quality
  5. Deploy to database and retire old code

Why Migrate?

From README.md:170-179:
Your existing scrapers keep running while you verify. No big bang migration required.
Benefits:
  • Database-first management: Change settings across 100 spiders with one SQL query
  • Uniform structure: Consistent schema, validation, naming conventions
  • Built-in features: Cloudflare bypass, checkpoint, proxy escalation, incremental crawling
  • Easy to review: JSON configs are easier to audit than Python code
  • AI-assisted updates: Point an agent at a broken spider to auto-fix extraction rules

Migration Workflow

Using an AI agent (Claude Code, Cursor, etc.):
You: "Migrate my spider at scripts/bbc_spider.py to ScrapAI"
Agent: [Reads Python, extracts URL patterns and selectors, writes JSON config, tests, saves to database]

Manual Migration

For direct control:
  1. Read your existing spider code
  2. Extract URL patterns, selectors, and extraction logic
  3. Write equivalent JSON config (see examples below)
  4. Import: ./scrapai spiders import config.json --project myproject
  5. Test: ./scrapai crawl spider_name --project myproject --limit 5
  6. Compare output with original spider
  7. Iterate until quality matches

Scrapy Spider Migration

Original Scrapy Spider

scrapy_spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BBCSpider(CrawlSpider):
    name = 'bbc'
    allowed_domains = ['bbc.com', 'bbc.co.uk']
    start_urls = ['https://www.bbc.com/news']

    rules = (
        # Follow category pages
        Rule(
            LinkExtractor(allow=r'/news/[a-z_]+$'),
            follow=True,
        ),
        # Extract articles
        Rule(
            LinkExtractor(allow=r'/news/articles/[a-z0-9-]+$'),
            callback='parse_article',
            follow=False,
        ),
    )

    def parse_article(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1.article-headline::text').get(),
            'content': ' '.join(response.css('div.article-body p::text').getall()),
            'author': response.css('span.author-name::text').get(),
            'date': response.css('time.date-published::attr(datetime)').get(),
        }

Equivalent ScrapAI Config

bbc_config.json
{
  "name": "bbc_news",
  "source_url": "https://www.bbc.com/news",
  "allowed_domains": ["bbc.com", "bbc.co.uk"],
  "start_urls": ["https://www.bbc.com/news"],
  "rules": [
    {
      "allow": ["/news/[a-z_]+$"],
      "follow": true,
      "priority": 10
    },
    {
      "allow": ["/news/articles/[a-z0-9-]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1
  },
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {
          "css": "h1.article-headline::text"
        },
        "content": {
          "css": "div.article-body p::text",
          "get_all": true,
          "processors": [
            {"type": "join", "separator": " "}
          ]
        },
        "author": {
          "css": "span.author-name::text"
        },
        "published_date": {
          "css": "time.date-published::attr(datetime)",
          "processors": [
            {"type": "parse_datetime"}
          ]
        }
      }
    }
  }
}

Key Mappings

Scrapy ConceptScrapAI Equivalent
namename
allowed_domainsallowed_domains
start_urlsstart_urls
LinkExtractor(allow=...)rules[].allow
LinkExtractor(deny=...)rules[].deny
Rule(follow=True)rules[].follow: true
Rule(callback='parse')rules[].callback: "parse"
response.css('selector::text').get()css: "selector::text"
response.css('selector::text').getall()css: "selector::text", get_all: true
response.xpath('//div')xpath: "//div"
' '.join(texts)processors: [{"type": "join"}]

BeautifulSoup Migration

Original BeautifulSoup Script

bs4_scraper.py
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json

url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

products = []
for item in soup.select('div.product-card'):
    product = {
        'name': item.select_one('h3.product-name').text.strip(),
        'price': item.select_one('span.price').text.strip().replace('$', ''),
        'url': urljoin(url, item.select_one('a')['href']),
        'image': item.select_one('img')['src'],
    }
    products.append(product)

with open('products.json', 'w') as f:
    json.dump(products, f, indent=2)

Equivalent ScrapAI Config

products_config.json
{
  "name": "example_products",
  "source_url": "https://example.com/products",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/products"],
  "rules": [
    {
      "allow": ["/products$"],
      "callback": "parse_products",
      "follow": false
    }
  ],
  "settings": {
    "DOWNLOAD_DELAY": 0.5
  },
  "callbacks": {
    "parse_products": {
      "extract": {
        "products": {
          "type": "nested_list",
          "selector": "div.product-card",
          "extract": {
            "name": {
              "css": "h3.product-name::text",
              "processors": [{"type": "strip"}]
            },
            "price": {
              "css": "span.price::text",
              "processors": [
                {"type": "strip"},
                {"type": "replace", "old": "$", "new": ""},
                {"type": "cast", "to": "float"}
              ]
            },
            "url": {
              "css": "a::attr(href)"
            },
            "image": {
              "css": "img::attr(src)"
            }
          }
        }
      }
    }
  }
}

Key Differences

ScrapAI provides automatic request handling, link extraction, retries, rate limiting, and JSONL export - eliminating manual boilerplate code.

Scrapling Migration

Migrate from Scrapling when managing 10+ sites with similar structure. Keep Scrapling for single sites with complex interactions, heavy JavaScript, or login flows.

Example Migration

scrapling_script.py
from scrapling import Fetcher

fetcher = Fetcher()
page = fetcher.get('https://news.ycombinator.com')

titles = page.css('span.titleline > a').text_content(all=True)
for title in titles:
    print(title)
hn_config.json
{
  "name": "hackernews",
  "source_url": "https://news.ycombinator.com",
  "allowed_domains": ["news.ycombinator.com"],
  "start_urls": ["https://news.ycombinator.com"],
  "rules": [
    {
      "allow": ["/$"],
      "callback": "parse_frontpage",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_frontpage": {
      "extract": {
        "stories": {
          "type": "nested_list",
          "selector": "span.titleline > a",
          "extract": {
            "title": {
              "css": "::text"
            },
            "url": {
              "css": "::attr(href)"
            }
          }
        }
      }
    }
  }
}

Processors for Data Cleaning

From core/schemas.py:131-156:
allowed = {
    "strip",
    "replace",
    "regex",
    "cast",
    "join",
    "default",
    "lowercase",
    "parse_datetime",
}

Common Processor Patterns

Strip whitespace:
{"type": "strip"}
Remove characters:
{"type": "replace", "old": "$", "new": ""}
Extract with regex:
{"type": "regex", "pattern": "\\d+"}
Convert type:
{"type": "cast", "to": "float"}
Join list:
{"type": "join", "separator": " "}
Default value:
{"type": "default", "value": "Unknown"}
Lowercase:
{"type": "lowercase"}
Parse datetime:
{"type": "parse_datetime"}

Validation During Migration

All configs are validated before import:
  • Spider names: Alphanumeric characters, underscores, and hyphens only
  • URLs: HTTP/HTTPS only, with SSRF protection (blocks localhost and private IPs)
  • Callbacks: Must be valid Python identifiers, cannot use reserved names (parse, parse_article, etc.)

Testing After Migration

Compare Output Quality

# Run old spider
python old_spider.py > old_output.json

# Run new ScrapAI spider
./scrapai crawl new_spider --project myproject --limit 10
./scrapai export new_spider --project myproject --format json > new_output.json

# Compare field coverage
python -c "
import json
old = json.load(open('old_output.json'))
new = json.load(open('new_output.json'))

old_fields = set(old[0].keys())
new_fields = set(new[0].keys())

print('Missing fields:', old_fields - new_fields)
print('Extra fields:', new_fields - old_fields)
"

Verify Extraction Rules

# Inspect a sample page
./scrapai inspect https://example.com/article --project myproject

# Check if selectors match
grep 'title' output.html
grep 'content' output.html

Performance Comparison

# Time old spider
time python old_spider.py

# Time new spider
time ./scrapai crawl new_spider --project myproject

Incremental Migration Strategy

Phase 1: Pilot (1-2 weeks)

  1. Pick 3-5 representative spiders
  2. Migrate to ScrapAI
  3. Run both old and new in parallel
  4. Compare output quality
  5. Tune extraction rules until quality matches

Phase 2: Batch Migration (2-4 weeks)

  1. Group remaining spiders by similarity
  2. Migrate one group at a time
  3. Reuse patterns from pilot spiders
  4. Test each batch before moving to next

Phase 3: Cutover (1 week)

  1. Switch production traffic to ScrapAI
  2. Keep old spiders as backup for 1 month
  3. Monitor error rates and data quality
  4. Retire old code once confident

Phase 4: Optimization (ongoing)

  1. Tune DOWNLOAD_DELAY and CONCURRENT_REQUESTS
  2. Enable DeltaFetch for incremental crawling
  3. Set up Airflow for scheduling
  4. Add custom callbacks for edge cases

Common Pitfalls

Regex Patterns: Copy patterns directly from Scrapy, then test with ./scrapai crawl --limit 5. Relative URLs: Scrapy handles urljoin() automatically. Use css: "a::attr(href)". Custom Middleware:
  • Proxy rotation: Use built-in proxy escalation
  • Cloudflare: Enable CLOUDFLARE_ENABLED: true
  • Custom headers: Add to spider settings
JavaScript/Dynamic Content: Use ScrapAI’s Playwright extractor:
{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright"],
    "PLAYWRIGHT_WAIT_SELECTOR": "div.content"
  }
}

See Also

Custom Callbacks

Write custom extraction logic for complex sites

Security

Understanding config validation and SSRF protection