spiders list
List all spiders in the database.Syntax
Options
Filter by project name. If omitted, shows spiders from all projects.
Examples
Output
Fields Displayed
- Name: Spider identifier (used in
crawlandshowcommands) - Project: Project tag in brackets
- Active: Whether spider is enabled (currently always
True) - Created: Initial import timestamp
- Updated: Last modification timestamp
- Source: Original website URL (if specified in config)
spiders import
Import or update a spider from a JSON configuration file.Syntax
Arguments
Path to JSON spider configuration file. Use
- to read from stdin.Options
Project name to associate with this spider.
Skip Pydantic schema validation (not recommended). Use only for backward compatibility.
Examples
Spider Configuration Format
Configuration Fields
Spider name (letters, numbers, hyphens, underscores only). Must be unique per project.
List of domains this spider can crawl. URLs outside these domains are filtered.
Initial URLs to crawl. Must be valid HTTP/HTTPS URLs.
Original website URL (for documentation purposes).
URL pattern matching rules. Each rule defines which URLs to follow and how to process them.
Spider-specific settings that override defaults.
Custom extraction callbacks with CSS/XPath selectors for non-article content.
Validation
All spider configs are validated through Pydantic schemas before import:- Spider names:
^[a-zA-Z0-9_-]+$pattern - URLs: HTTP/HTTPS only, no private IPs (127.0.0.1, 10.x, 172.16.x, 192.168.x), max 2048 chars
- Callback names: Whitelisted names only, reserved names blocked
- Settings: Bounded values (concurrency 1-32, delays 0-60s)
- Extractor order: Valid extractor names only
Output
Successful Import
Update Existing Spider
Re-importing a spider replaces its configuration entirely. All rules and settings are deleted and recreated.
Validation Failure
spiders delete
Delete a spider and all its associated data.Syntax
Arguments
Spider name to delete.
Options
Project name. If specified, only deletes spider from that project.
Skip confirmation prompt.
Examples
Output
With Confirmation
Force Delete
Database Storage
Spiders are stored across multiple tables:spiders Table
id: Primary key (auto-increment)name: Spider name (unique per project)project: Project nameallowed_domains: JSON arraystart_urls: JSON arraysource_url: Original website URLactive: Boolean (currently always true)callbacks_config: JSON object with callback definitionscreated_at: Timestampupdated_at: Timestamp
spider_rules Table
spider_id: Foreign key to spidersallow_patterns: JSON array of URL patterns to allowdeny_patterns: JSON array of URL patterns to denyrestrict_xpaths: JSON array of XPath restrictionsrestrict_css: JSON array of CSS restrictionscallback: Callback function namefollow: Boolean (whether to follow links)priority: Integer (higher = processed first)
spider_settings Table
spider_id: Foreign key to spiderskey: Setting namevalue: Setting value (as string)type: Value type (str, int, bool, json)
Working with Templates
ScrapAI includes example spider configs intemplates/:
- News sites (BBC, Reuters)
- E-commerce (product listings)
- Forums (discussion threads)
- Cloudflare-protected sites