Migrating Existing Scrapers

Migrate existing scrapers from Scrapy, BeautifulSoup, Scrapling, or any Python scraping framework to scrapai’s database-driven architecture.

Overview

Migrating to scrapai means converting your Python scraping code into JSON configs. The process:

Analyze existing code to understand extraction logic
Map to scrapai concepts (rules, extractors, callbacks)
Generate JSON config with equivalent behavior
Test and verify extraction quality
Deploy to database and retire old code

Why Migrate?

From README.md:170-179:

Your existing scrapers keep running while you verify. No big bang migration required.

Benefits:

Database-first management: Change settings across 100 spiders with one SQL query
Uniform structure: Consistent schema, validation, naming conventions
Built-in features: Cloudflare bypass, checkpoint, proxy escalation, incremental crawling
Easy to review: JSON configs are easier to audit than Python code
AI-assisted updates: Point an agent at a broken spider to auto-fix extraction rules

Migration Workflow

Using an AI agent (Claude Code, Cursor, etc.):

You: "Migrate my spider at scripts/bbc_spider.py to scrapai"
Agent: [Reads Python, extracts URL patterns and selectors, writes JSON config, tests, saves to database]

Manual Migration

For direct control:

Read your existing spider code
Extract URL patterns, selectors, and extraction logic
Write equivalent JSON config (see examples below)
Import: ./scrapai spiders import config.json --project myproject
Test: ./scrapai crawl spider_name --project myproject --limit 5
Compare output with original spider
Iterate until quality matches

Scrapy Spider Migration

Original Scrapy Spider

scrapy_spider.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BBCSpider(CrawlSpider):
    name = 'bbc'
    allowed_domains = ['bbc.com', 'bbc.co.uk']
    start_urls = ['https://www.bbc.com/news']

    rules = (
        # Follow category pages
        Rule(
            LinkExtractor(allow=r'/news/[a-z_]+$'),
            follow=True,
        ),
        # Extract articles
        Rule(
            LinkExtractor(allow=r'/news/articles/[a-z0-9-]+$'),
            callback='parse_article',
            follow=False,
        ),
    )

    def parse_article(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1.article-headline::text').get(),
            'content': ' '.join(response.css('div.article-body p::text').getall()),
            'author': response.css('span.author-name::text').get(),
            'date': response.css('time.date-published::attr(datetime)').get(),
        }

Equivalent scrapai Config

bbc_config.json

{
  "name": "bbc_news",
  "source_url": "https://www.bbc.com/news",
  "allowed_domains": ["bbc.com", "bbc.co.uk"],
  "start_urls": ["https://www.bbc.com/news"],
  "rules": [
    {
      "allow": ["/news/[a-z_]+$"],
      "follow": true,
      "priority": 10
    },
    {
      "allow": ["/news/articles/[a-z0-9-]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 100
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 1
  },
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {
          "css": "h1.article-headline::text"
        },
        "content": {
          "css": "div.article-body p::text",
          "get_all": true,
          "processors": [
            {"type": "join", "separator": " "}
          ]
        },
        "author": {
          "css": "span.author-name::text"
        },
        "published_date": {
          "css": "time.date-published::attr(datetime)",
          "processors": [
            {"type": "parse_datetime"}
          ]
        }
      }
    }
  }
}

Key Mappings

Scrapy Concept	scrapai Equivalent
`name`	`name`
`allowed_domains`	`allowed_domains`
`start_urls`	`start_urls`
`LinkExtractor(allow=...)`	`rules[].allow`
`LinkExtractor(deny=...)`	`rules[].deny`
`Rule(follow=True)`	`rules[].follow: true`
`Rule(callback='parse')`	`rules[].callback: "parse"`
`response.css('selector::text').get()`	`css: "selector::text"`
`response.css('selector::text').getall()`	`css: "selector::text", get_all: true`
`response.xpath('//div')`	`xpath: "//div"`
`' '.join(texts)`	`processors: [{"type": "join"}]`

BeautifulSoup Migration

Original BeautifulSoup Script

bs4_scraper.py

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json

url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

products = []
for item in soup.select('div.product-card'):
    product = {
        'name': item.select_one('h3.product-name').text.strip(),
        'price': item.select_one('span.price').text.strip().replace('$', ''),
        'url': urljoin(url, item.select_one('a')['href']),
        'image': item.select_one('img')['src'],
    }
    products.append(product)

with open('products.json', 'w') as f:
    json.dump(products, f, indent=2)

Equivalent scrapai Config

products_config.json

{
  "name": "example_products",
  "source_url": "https://example.com/products",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/products"],
  "rules": [
    {
      "allow": ["/products$"],
      "callback": "parse_products",
      "follow": false
    }
  ],
  "settings": {
    "DOWNLOAD_DELAY": 0.5
  },
  "callbacks": {
    "parse_products": {
      "extract": {
        "products": {
          "type": "nested_list",
          "selector": "div.product-card",
          "extract": {
            "name": {
              "css": "h3.product-name::text",
              "processors": [{"type": "strip"}]
            },
            "price": {
              "css": "span.price::text",
              "processors": [
                {"type": "strip"},
                {"type": "replace", "old": "$", "new": ""},
                {"type": "cast", "to": "float"}
              ]
            },
            "url": {
              "css": "a::attr(href)"
            },
            "image": {
              "css": "img::attr(src)"
            }
          }
        }
      }
    }
  }
}

Key Differences

scrapai provides automatic request handling, link extraction, retries, rate limiting, and JSONL export - eliminating manual boilerplate code.

Scrapling Migration

Migrate from Scrapling when managing 10+ sites with similar structure. Keep Scrapling for single sites with complex interactions, heavy JavaScript, or login flows.

Example Migration

scrapling_script.py

from scrapling import Fetcher

fetcher = Fetcher()
page = fetcher.get('https://news.ycombinator.com')

titles = page.css('span.titleline > a').text_content(all=True)
for title in titles:
    print(title)

hn_config.json

{
  "name": "hackernews",
  "source_url": "https://news.ycombinator.com",
  "allowed_domains": ["news.ycombinator.com"],
  "start_urls": ["https://news.ycombinator.com"],
  "rules": [
    {
      "allow": ["/$"],
      "callback": "parse_frontpage",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_frontpage": {
      "extract": {
        "stories": {
          "type": "nested_list",
          "selector": "span.titleline > a",
          "extract": {
            "title": {
              "css": "::text"
            },
            "url": {
              "css": "::attr(href)"
            }
          }
        }
      }
    }
  }
}

Processors for Data Cleaning

From core/schemas.py:131-156:

allowed = {
    "strip",
    "replace",
    "regex",
    "cast",
    "join",
    "default",
    "lowercase",
    "parse_datetime",
}

Common Processor Patterns

Strip whitespace:

{"type": "strip"}

Remove characters:

{"type": "replace", "old": "$", "new": ""}

Extract with regex:

{"type": "regex", "pattern": "\\d+"}

Convert type:

{"type": "cast", "to": "float"}

Join list:

{"type": "join", "separator": " "}

Default value:

{"type": "default", "value": "Unknown"}

Lowercase:

{"type": "lowercase"}

Parse datetime:

{"type": "parse_datetime"}

Validation During Migration

All configs are validated before import:

Spider names: Alphanumeric characters, underscores, and hyphens only
URLs: HTTP/HTTPS only, with SSRF protection (blocks localhost and private IPs)
Callbacks: Must be valid Python identifiers, cannot use reserved names (parse, parse_article, etc.)

Testing After Migration

Compare Output Quality

# Run old spider
python old_spider.py > old_output.json

# Run new scrapai spider
./scrapai crawl new_spider --project myproject --limit 10
./scrapai export new_spider --project myproject --format json > new_output.json

# Compare field coverage
python -c "
import json
old = json.load(open('old_output.json'))
new = json.load(open('new_output.json'))

old_fields = set(old[0].keys())
new_fields = set(new[0].keys())

print('Missing fields:', old_fields - new_fields)
print('Extra fields:', new_fields - old_fields)
"

Verify Extraction Rules

# Inspect a sample page
./scrapai inspect https://example.com/article --project myproject

# Check if selectors match
grep 'title' output.html
grep 'content' output.html

Performance Comparison

# Time old spider
time python old_spider.py

# Time new spider
time ./scrapai crawl new_spider --project myproject

Incremental Migration Strategy

Phase 1: Pilot (1-2 weeks)

Pick 3-5 representative spiders
Migrate to scrapai
Run both old and new in parallel
Compare output quality
Tune extraction rules until quality matches

Phase 2: Batch Migration (2-4 weeks)

Group remaining spiders by similarity
Migrate one group at a time
Reuse patterns from pilot spiders
Test each batch before moving to next

Phase 3: Cutover (1 week)

Switch production traffic to scrapai
Keep old spiders as backup for 1 month
Monitor error rates and data quality
Retire old code once confident

Phase 4: Optimization (ongoing)

Tune DOWNLOAD_DELAY and CONCURRENT_REQUESTS
Enable DeltaFetch for incremental crawling
Schedule recurring crawls with cron + Pueue (detached runs)
Add custom callbacks for edge cases

Common Pitfalls

Regex Patterns: Copy patterns directly from Scrapy, then test with ./scrapai crawl --limit 5. Relative URLs: Scrapy handles urljoin() automatically. Use css: "a::attr(href)". Custom Middleware:

Proxy rotation: Use built-in proxy escalation
Cloudflare: Enable CLOUDFLARE_ENABLED: true
Custom headers: Add to spider settings

JavaScript/Dynamic Content: Use scrapai’s Playwright extractor:

{
  "settings": {
    "EXTRACTOR_ORDER": ["playwright"],
    "PLAYWRIGHT_WAIT_SELECTOR": "div.content"
  }
}

Custom Callbacks

Write custom extraction logic for complex sites

Security

Understanding config validation and SSRF protection

Get Started

Core Concepts

AI Agents

Guides

Configuration

Advanced

Migrating Existing Scrapers

Overview

Why Migrate?

Migration Workflow

Manual Migration

Scrapy Spider Migration

Original Scrapy Spider

Equivalent scrapai Config

Key Mappings

BeautifulSoup Migration

Original BeautifulSoup Script

Equivalent scrapai Config

Key Differences

Scrapling Migration

Example Migration

Processors for Data Cleaning

Common Processor Patterns

Validation During Migration

Testing After Migration

Compare Output Quality

Verify Extraction Rules

Performance Comparison

Incremental Migration Strategy

Phase 1: Pilot (1-2 weeks)

Phase 2: Batch Migration (2-4 weeks)

Phase 3: Cutover (1 week)

Phase 4: Optimization (ongoing)

Common Pitfalls

See Also

Custom Callbacks

Security

​Overview

​Why Migrate?

​Migration Workflow

​Manual Migration

​Scrapy Spider Migration

​Original Scrapy Spider

​Equivalent scrapai Config

​Key Mappings

​BeautifulSoup Migration

​Original BeautifulSoup Script

​Equivalent scrapai Config

​Key Differences

​Scrapling Migration

​Example Migration

​Processors for Data Cleaning

​Common Processor Patterns

​Validation During Migration

​Testing After Migration

​Compare Output Quality

​Verify Extraction Rules

​Performance Comparison

​Incremental Migration Strategy

​Phase 1: Pilot (1-2 weeks)

​Phase 2: Batch Migration (2-4 weeks)

​Phase 3: Cutover (1 week)

​Phase 4: Optimization (ongoing)

​Common Pitfalls

​See Also

Custom Callbacks

Security

Overview

Why Migrate?

Migration Workflow

Manual Migration

Scrapy Spider Migration

Original Scrapy Spider

Equivalent scrapai Config

Key Mappings

BeautifulSoup Migration

Original BeautifulSoup Script

Equivalent scrapai Config

Key Differences

Scrapling Migration

Example Migration

Processors for Data Cleaning

Common Processor Patterns

Validation During Migration

Testing After Migration

Compare Output Quality

Verify Extraction Rules

Performance Comparison

Incremental Migration Strategy

Phase 1: Pilot (1-2 weeks)

Phase 2: Batch Migration (2-4 weeks)

Phase 3: Cutover (1 week)

Phase 4: Optimization (ongoing)

Common Pitfalls

See Also