Skip to main content
Spider management commands handle JSON configurations stored in the database.

spiders list

List all spiders in the database.

Syntax

./scrapai spiders list [--project <name>]

Options

--project
string
Filter by project name. If omitted, shows spiders from all projects.

Examples

# List all spiders across all projects
./scrapai spiders list

# List spiders in specific project
./scrapai spiders list --project news

Output

$ ./scrapai spiders list --project news
📋 Available Spiders (DB) - Project: news:
 bbc_co_uk [news] (Active: True) - Created: 2026-02-28 14:30, Updated: 2026-02-28 15:45
    Source: https://bbc.co.uk
 cnn_com [news] (Active: True) - Created: 2026-02-27 09:15, Updated: 2026-02-27 09:15
    Source: https://cnn.com
 reuters_com [news] (Active: True) - Created: 2026-02-26 16:20, Updated: 2026-02-28 11:30
    Source: https://reuters.com

spiders import

Import or update a spider from a JSON configuration file.

Syntax

./scrapai spiders import <file> --project <name> [--skip-validation]

Arguments

file
string
required
Path to JSON spider configuration file. Use - to read from stdin.

Options

--project
string
default:"default"
Project name to associate with this spider.
--skip-validation
flag
Skip Pydantic schema validation (not recommended). Use only for backward compatibility.

Examples

# Import spider from file
./scrapai spiders import bbc_spider.json --project news

# Import from stdin (useful in pipelines)
cat spider.json | ./scrapai spiders import - --project news

# Skip validation (backward compatibility)
./scrapai spiders import old_spider.json --project legacy --skip-validation

Spider Configuration Format

{
  "name": "bbc_co_uk",
  "allowed_domains": ["bbc.co.uk"],
  "start_urls": ["https://www.bbc.co.uk/news"],
  "source_url": "https://bbc.co.uk",
  "rules": [
    {
      "allow": ["/news/articles/[^/]+$"],
      "callback": "parse_article",
      "follow": false,
      "priority": 10
    },
    {
      "allow": ["/news/?$"],
      "follow": true,
      "priority": 5
    }
  ],
  "settings": {
    "EXTRACTOR_ORDER": ["newspaper", "trafilatura"],
    "DOWNLOAD_DELAY": 2,
    "CONCURRENT_REQUESTS": 8
  },
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {"css": "h1.article-headline::text"},
        "author": {"css": "span.author-name::text"},
        "content": {"css": "div.article-body", "get": "all_text"}
      }
    }
  }
}

Configuration Fields

name
string
required
Spider name (letters, numbers, hyphens, underscores only). Must be unique per project.
allowed_domains
array
required
List of domains this spider can crawl. URLs outside these domains are filtered.
start_urls
array
required
Initial URLs to crawl. Must be valid HTTP/HTTPS URLs.
source_url
string
Original website URL (for documentation purposes).
rules
array
URL pattern matching rules. Each rule defines which URLs to follow and how to process them.
settings
object
Spider-specific settings that override defaults.
callbacks
object
Custom extraction callbacks with CSS/XPath selectors for non-article content.

Output

Successful Import

$ ./scrapai spiders import bbc_spider.json --project news
 Spider 'bbc_co_uk' imported successfully!
   Project: news
   Domains: bbc.co.uk
   Start URLs: 1
   Rules: 2
   Callbacks: 1 (parse_article)

Update Existing Spider

$ ./scrapai spiders import bbc_spider.json --project news
⚠️  Spider 'bbc_co_uk' already exists. Updating...
 Spider 'bbc_co_uk' imported successfully!
   Project: news
   Domains: bbc.co.uk
   Start URLs: 1
   Rules: 2
   Callbacks: 1 (parse_article)
Re-importing a spider replaces its configuration entirely. All rules and settings are deleted and recreated.

Validation Failure

$ ./scrapai spiders import bad_spider.json --project news
 Spider configuration validation failed:
 name: string does not match pattern "^[a-zA-Z0-9_-]+$"
 start_urls -> 0: URL scheme must be http or https
 settings -> CONCURRENT_REQUESTS: value must be between 1 and 32

💡 Use --skip-validation to bypass validation (not recommended)

spiders delete

Delete a spider and all its associated data.

Syntax

./scrapai spiders delete <name> [--project <name>] [--force]

Arguments

name
string
required
Spider name to delete.

Options

--project
string
Project name. If specified, only deletes spider from that project.
--force, -f
flag
Skip confirmation prompt.

Examples

# Delete spider with confirmation
./scrapai spiders delete bbc_co_uk --project news

# Delete without confirmation
./scrapai spiders delete old_spider --project archive --force

Output

With Confirmation

$ ./scrapai spiders delete bbc_co_uk --project news
Are you sure you want to delete spider 'bbc_co_uk' in project 'news'? (y/N): y
🗑️  Spider 'bbc_co_uk' in project 'news' deleted!

Force Delete

$ ./scrapai spiders delete bbc_co_uk --project news --force
🗑️  Spider 'bbc_co_uk' in project 'news' deleted!
Deleting a spider removes all configuration, rules, settings, and all scraped items associated with this spider. This operation cannot be undone.

Next Steps

Run Crawls

Start crawling with your imported spiders

View Data

Inspect and export scraped items