Optional queue system for managing multiple website crawls. Use when explicitly needed. Always specify --project when working with queues.
Commands Reference
Add Single URL
./scrapai queue add < ur l > --project NAME [-m "instruction"] [--priority N]
Example:
./scrapai queue add https://example.com --project myproject -m "Focus on research articles" --priority 5
Bulk Add URLs
./scrapai queue bulk < file.csv | file.json > --project NAME [--priority N]
Example:
./scrapai queue bulk sites.json --project myproject
List Queue Items
Default (5 pending)
More Items
All Statuses
Filter by Status
Count Only
./scrapai queue list --project NAME
Shows 5 pending/processing items by default ./scrapai queue list --project NAME --limit 20
Show more queue items ./scrapai queue list --project NAME --all --limit 50
Include completed and failed items ./scrapai queue list --project NAME --status pending
./scrapai queue list --project NAME --status processing
./scrapai queue list --project NAME --status failed
Show specific status only ./scrapai queue list --project NAME --count
./scrapai queue list --project NAME --status failed --count
Just show the count
Claim Next Item
./scrapai queue next --project NAME
Claims next highest-priority pending item and marks it as processing.
Update Item Status
ID is globally unique, no --project needed for status updates
./scrapai queue complete < i d >
Cleanup Queue
Cleanup commands require --force flag to prevent accidental deletion
Completed Items
Failed Items
All Items
./scrapai queue cleanup --completed --force --project NAME
[
{
"url" : "https://site1.com" ,
"custom_instruction" : "Focus on research articles" ,
"priority" : 5
},
{
"url" : "https://site2.com" ,
"priority" : 10
},
{
"url" : "https://site3.com" ,
"custom_instruction" : "Include all news sections" ,
"priority" : 1
}
]
Template available: templates/queue-template.json
url, custom_instruction, priority
https://site1.com, Focus on research articles, 5
https://site2.com, , 10
https://site3.com, Include all news sections, 1
Template available: templates/queue-template.csv
Required columns:
url - Website URL (required)
custom_instruction - Special instructions for this site (optional)
priority - Higher number = higher priority (optional, default: 0)
Queue Processing Workflow
Claim next item
./scrapai queue next --project NAME
Note the returned values:
ID (for status updates)
URL (target website)
Project name
Custom instruction (if exists)
Process custom instruction
If custom_instruction exists, use it to override default analysis behavior. Example: Instruction “Focus only on research articles” means:
Only create rules for research article URL patterns
Skip other content sections
Run analysis phases
Execute standard analysis workflow (Phases 1-4). Include "source_url": "<queue_url>" in final_spider.json.
Import spider
./scrapai import final_spider.json --project myproject
Always use --project matching the queue item’s project! Without it, spider defaults to “default” project, mixing data across projects.
Run crawl
./scrapai crawl spider_name --project myproject
Update status on success
./scrapai queue complete < i d >
Update status on failure
./scrapai queue fail < i d > -m "error description"
Custom Instruction Handling
When queue next returns a custom_instruction, use it to override default analysis behavior.
Examples:
Instruction: “Focus only on research articles”Action:
Only create rules for research article URL patterns
Skip other content sections (news, blog posts, etc.)
Narrow down link-following rules
Instruction: “Include all news sections”Action:
Create rules for all news-related URL patterns
Follow navigation across all news categories
Don’t limit to specific article types
Instruction: “Only crawl 2 levels deep”Action:
Configure spider with depth limit
Adjust link-following rules accordingly
Instruction: “Use h2.headline for titles”Action:
Add custom selectors in spider config
Override default extractor configuration
Custom instructions allow per-site customization without modifying the base analysis workflow.
Direct vs Queue Processing
Understand when to use queue vs direct processing:
Direct Processing
Add to Queue
Process from Queue
User says: “Add this website: X” or “Process X”Action:
Process immediately (run Phases 1-4)
No queue involved
Import and optionally test crawl
# Example flow
./scrapai inspect https://example.com --project proj
# ... run analysis phases ...
./scrapai import final_spider.json --project proj
./scrapai crawl spider_name --project proj --limit 10
User says: “Add X to the queue” or “Queue X”Action:
Add to queue only - do NOT process
User will process later with queue next
./scrapai queue add https://example.com --project proj -m "Focus on articles"
User says: “Process next” or “Process queue”Action:
Claim next item with queue next
Then run Phases 1-4
Import and test crawl
Update status (complete/fail)
./scrapai queue next --project proj
# ... run analysis phases ...
./scrapai import final_spider.json --project proj
./scrapai crawl spider_name --project proj --limit 10
./scrapai queue complete < i d >
Project Isolation
CRITICAL: NEVER omit --project when importing spiders from queue.Without it, spider defaults to “default” project, mixing data across projects. Always match the queue item’s project.
Correct:
./scrapai import final_spider.json --project myproject
./scrapai crawl spider_name --project myproject
Incorrect:
./scrapai import final_spider.json # Defaults to "default" project!
./scrapai crawl spider_name # Defaults to "default" project!
Complete Example
1. Create Queue File
[
{
"url" : "https://techcrunch.com" ,
"custom_instruction" : "Focus on AI and startup news" ,
"priority" : 10
},
{
"url" : "https://theverge.com" ,
"custom_instruction" : "All technology articles" ,
"priority" : 5
},
{
"url" : "https://arstechnica.com" ,
"priority" : 3
}
]
2. Bulk Add to Queue
./scrapai queue bulk sites.json --project tech_news
3. Check Queue
./scrapai queue list --project tech_news
Output:
Queue items for project 'tech_news':
ID: 1
URL: https://techcrunch.com
Status: pending
Priority: 10
Custom instruction: Focus on AI and startup news
Created: 2024-02-24 10:30:00
... (2 more items)
4. Process Queue
# Claim next (highest priority)
./scrapai queue next --project tech_news
# Output:
# Claimed queue item #1
# Project: tech_news
# URL: https://techcrunch.com
# Custom instruction: Focus on AI and startup news
# Priority: 10
# Run analysis workflow (Phases 1-4)
./scrapai inspect https://techcrunch.com --project tech_news
# ... complete analysis phases ...
# Import spider
./scrapai import final_spider.json --project tech_news
# Test crawl
./scrapai crawl techcrunch --project tech_news --limit 10
# Mark as complete
./scrapai queue complete 1
5. Handle Failure
# If crawl fails
./scrapai queue fail 1 -m "Site requires authentication"
# Check failed items
./scrapai queue list --project tech_news --status failed
# Retry if needed
./scrapai queue retry 1
6. Cleanup
# Remove completed items
./scrapai queue cleanup --completed --force --project tech_news
# Or remove all
./scrapai queue cleanup --all --force --project tech_news
Statistics
Check queue statistics:
# Count by status
./scrapai queue list --project tech_news --status pending --count
./scrapai queue list --project tech_news --status completed --count
./scrapai queue list --project tech_news --status failed --count
# View all items
./scrapai queue list --project tech_news --all --limit 100
Best Practices
Use descriptive custom instructions
Help guide the analysis process with clear instructions: ./scrapai queue add https://example.com --project proj -m "Focus on research papers, ignore blog posts"
Set meaningful priorities
Higher numbers processed first:
10: Urgent, high-value sites
5: Standard priority
1: Low priority, process when available
Always specify project
Prevent data mixing across projects: ./scrapai import spider.json --project myproject
./scrapai crawl spider --project myproject
Document failures
Help debug issues later: ./scrapai queue fail < i d > -m "Requires JavaScript rendering - use --browser flag"
Regular cleanup
Remove old completed items: ./scrapai queue cleanup --completed --force --project myproject