Skip to main content
Optional queue system for managing multiple website crawls. Use when explicitly needed. Always specify --project when working with queues.

Commands Reference

Add Single URL

./scrapai queue add <url> --project NAME [-m "instruction"] [--priority N]
Example:
./scrapai queue add https://example.com --project myproject -m "Focus on research articles" --priority 5

Bulk Add URLs

./scrapai queue bulk <file.csv|file.json> --project NAME [--priority N]
Example:
./scrapai queue bulk sites.json --project myproject

List Queue Items

./scrapai queue list --project NAME
Shows 5 pending/processing items by default

Claim Next Item

./scrapai queue next --project NAME
Claims next highest-priority pending item and marks it as processing.

Update Item Status

ID is globally unique, no --project needed for status updates
./scrapai queue complete <id>

Cleanup Queue

Cleanup commands require --force flag to prevent accidental deletion
./scrapai queue cleanup --completed --force --project NAME

Bulk File Formats

JSON Format

queue.json
[
  {
    "url": "https://site1.com",
    "custom_instruction": "Focus on research articles",
    "priority": 5
  },
  {
    "url": "https://site2.com",
    "priority": 10
  },
  {
    "url": "https://site3.com",
    "custom_instruction": "Include all news sections",
    "priority": 1
  }
]
Template available: templates/queue-template.json

CSV Format

queue.csv
url,custom_instruction,priority
https://site1.com,Focus on research articles,5
https://site2.com,,10
https://site3.com,Include all news sections,1
Template available: templates/queue-template.csv
Required columns:
  • url - Website URL (required)
  • custom_instruction - Special instructions for this site (optional)
  • priority - Higher number = higher priority (optional, default: 0)

Queue Processing Workflow

1

Claim next item

./scrapai queue next --project NAME
Note the returned values:
  • ID (for status updates)
  • URL (target website)
  • Project name
  • Custom instruction (if exists)
2

Process custom instruction

If custom_instruction exists, use it to override default analysis behavior.Example: Instruction “Focus only on research articles” means:
  • Only create rules for research article URL patterns
  • Skip other content sections
3

Run analysis phases

Execute standard analysis workflow (Phases 1-4).Include "source_url": "<queue_url>" in final_spider.json.
4

Import spider

./scrapai import final_spider.json --project myproject
Always use --project matching the queue item’s project! Without it, spider defaults to “default” project, mixing data across projects.
5

Run crawl

./scrapai crawl spider_name --project myproject
6

Update status on success

./scrapai queue complete <id>
7

Update status on failure

./scrapai queue fail <id> -m "error description"

Custom Instruction Handling

When queue next returns a custom_instruction, use it to override default analysis behavior. Examples:
Instruction: “Focus only on research articles”Action:
  • Only create rules for research article URL patterns
  • Skip other content sections (news, blog posts, etc.)
  • Narrow down link-following rules
Custom instructions allow per-site customization without modifying the base analysis workflow.

Direct vs Queue Processing

Understand when to use queue vs direct processing:
User says: “Add this website: X” or “Process X”Action:
  • Process immediately (run Phases 1-4)
  • No queue involved
  • Import and optionally test crawl
# Example flow
./scrapai inspect https://example.com --project proj
# ... run analysis phases ...
./scrapai import final_spider.json --project proj
./scrapai crawl spider_name --project proj --limit 10

Project Isolation

CRITICAL: NEVER omit --project when importing spiders from queue.Without it, spider defaults to “default” project, mixing data across projects. Always match the queue item’s project.
Correct:
./scrapai import final_spider.json --project myproject
./scrapai crawl spider_name --project myproject
Incorrect:
./scrapai import final_spider.json  # Defaults to "default" project!
./scrapai crawl spider_name  # Defaults to "default" project!

Complete Example

1. Create Queue File

sites.json
[
  {
    "url": "https://techcrunch.com",
    "custom_instruction": "Focus on AI and startup news",
    "priority": 10
  },
  {
    "url": "https://theverge.com",
    "custom_instruction": "All technology articles",
    "priority": 5
  },
  {
    "url": "https://arstechnica.com",
    "priority": 3
  }
]

2. Bulk Add to Queue

./scrapai queue bulk sites.json --project tech_news

3. Check Queue

./scrapai queue list --project tech_news
Output:
Queue items for project 'tech_news':

ID: 1
URL: https://techcrunch.com
Status: pending
Priority: 10
Custom instruction: Focus on AI and startup news
Created: 2024-02-24 10:30:00

... (2 more items)

4. Process Queue

# Claim next (highest priority)
./scrapai queue next --project tech_news

# Output:
# Claimed queue item #1
# Project: tech_news
# URL: https://techcrunch.com
# Custom instruction: Focus on AI and startup news
# Priority: 10

# Run analysis workflow (Phases 1-4)
./scrapai inspect https://techcrunch.com --project tech_news
# ... complete analysis phases ...

# Import spider
./scrapai import final_spider.json --project tech_news

# Test crawl
./scrapai crawl techcrunch --project tech_news --limit 10

# Mark as complete
./scrapai queue complete 1

5. Handle Failure

# If crawl fails
./scrapai queue fail 1 -m "Site requires authentication"

# Check failed items
./scrapai queue list --project tech_news --status failed

# Retry if needed
./scrapai queue retry 1

6. Cleanup

# Remove completed items
./scrapai queue cleanup --completed --force --project tech_news

# Or remove all
./scrapai queue cleanup --all --force --project tech_news

Statistics

Check queue statistics:
# Count by status
./scrapai queue list --project tech_news --status pending --count
./scrapai queue list --project tech_news --status completed --count
./scrapai queue list --project tech_news --status failed --count

# View all items
./scrapai queue list --project tech_news --all --limit 100

Best Practices

1

Use descriptive custom instructions

Help guide the analysis process with clear instructions:
./scrapai queue add https://example.com --project proj -m "Focus on research papers, ignore blog posts"
2

Set meaningful priorities

Higher numbers processed first:
  • 10: Urgent, high-value sites
  • 5: Standard priority
  • 1: Low priority, process when available
3

Always specify project

Prevent data mixing across projects:
./scrapai import spider.json --project myproject
./scrapai crawl spider --project myproject
4

Document failures

Help debug issues later:
./scrapai queue fail <id> -m "Requires JavaScript rendering - use --browser flag"
5

Regular cleanup

Remove old completed items:
./scrapai queue cleanup --completed --force --project myproject