The inspect command fetches and analyzes a website to help you understand its structure and build scraper configurations. It supports three modes: lightweight HTTP, browser-based (for JavaScript sites), and Cloudflare bypass.
inspect
Inspect a website URL.
Syntax
./scrapai inspect <url> [options]
Arguments
Options
Project name (used for saving analysis files).
Directory to save analysis files. Defaults to data/<project>/inspect/.
Proxy type: none, static, residential, auto.
Do not save the full HTML to disk.
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Automatically handles browser challenges and renders dynamic content.
Logging level: debug, info, warning, error, critical.
Modes
HTTP Mode (Default)
Lightweight HTTP fetch with requests library:
./scrapai inspect https://example.com --project myproject
Use for:
- Simple websites with server-side rendering
- Static HTML sites
- Fastest inspection method
Output:
⚡ Using lightweight HTTP fetch
Fetching: https://example.com
Status: 200
Content-Type: text/html; charset=utf-8
Content-Length: 45,231 bytes
Analysis saved to: data/myproject/inspect/example_com/
• page.html (full HTML)
• metadata.json (headers, status, timing)
Browser Mode
Use Playwright for JavaScript-heavy sites:
./scrapai inspect https://spa-site.com --project myproject --browser
Use for:
- Single-page applications (React, Vue, Angular)
- Sites with JavaScript-rendered content
- Dynamic content loading
Output:
🌐 Using browser for JS-rendered content
Launching browser...
Navigating to: https://spa-site.com
Waiting for page load...
Content rendered!
Analysis saved to: data/myproject/inspect/spa_site_com/
• page.html (rendered HTML after JS execution)
• screenshot.png (page screenshot)
• metadata.json
Browser mode waits for JavaScript to execute and renders the final DOM. This is the HTML you should analyze for extraction selectors.
The --browser flag automatically handles both JavaScript rendering and Cloudflare bypass when needed.
Additional browser mode features:
- Automatic Cloudflare challenge detection and bypass
- Cookie extraction for session persistence
- Browser fingerprinting resistance
Linux (headless server):
Requires xvfb for browser automation:
xvfb-run -a ./scrapai inspect https://protected-site.com --project myproject --browser
Install xvfb if needed:
sudo apt-get install xvfb
Output when Cloudflare is detected:
🖥️ Browser mode enabled
Solving Cloudflare challenge...
Challenge solved!
Extracting cookies...
Analysis saved to: data/myproject/inspect/protected_site_com/
• page.html (final HTML after bypass)
• cookies.json (Cloudflare session cookies)
• metadata.json
Saved Files
Inspection saves files to data/<project>/inspect/<domain>/:
page.html
Full HTML content:
- HTTP mode: Raw HTML from server
- Browser mode: Rendered HTML after JavaScript execution, with automatic Cloudflare bypass when detected
Use this file with the analyze command to discover CSS selectors.
Request metadata:
{
"url": "https://example.com",
"status_code": 200,
"headers": {
"content-type": "text/html; charset=utf-8",
"server": "nginx",
"content-length": "45231"
},
"fetch_time": "2026-02-28T15:30:42",
"elapsed_seconds": 0.823,
"mode": "http"
}
cookies.json (Browser mode with Cloudflare)
Cloudflare session cookies:
{
"cf_clearance": "abc123...",
"__cf_bm": "xyz789...",
"expires_at": "2026-02-28T15:40:42"
}
These cookies can be used in spider settings to bypass Cloudflare without browser on every request.
screenshot.png (Browser mode)
Full-page screenshot for visual verification.
Proxy Support
Specify proxy type for inspection:
# No proxy
./scrapai inspect https://example.com --proxy-type none
# Datacenter proxy
./scrapai inspect https://example.com --proxy-type static
# Residential proxy
./scrapai inspect https://example.com --proxy-type residential
# Auto (smart escalation)
./scrapai inspect https://example.com --proxy-type auto
Requires proxy configuration in .env.
Skip HTML Saving
For quick inspection without saving files:
./scrapai inspect https://example.com --no-save-html
Prints analysis to console only, doesn’t write files.
Logging
Control logging verbosity:
# Debug logging
./scrapai inspect https://example.com --log-level debug
# Save logs to file
./scrapai inspect https://example.com --log-file inspect.log
analyze
Analyze saved HTML for CSS selector discovery (separate command, not a subcommand of inspect).
Syntax
./scrapai analyze <html_file> [options]
Arguments
Path to HTML file to analyze.
Options
Test a specific CSS selector.
Find elements by keyword (searches classes and IDs).
Examples
# Analyze HTML structure
./scrapai analyze data/news/inspect/example_com/page.html
# Test a CSS selector
./scrapai analyze page.html --test "article.post h1.title"
# Find elements with keyword
./scrapai analyze page.html --find "author"
Output (Analysis Mode)
$ ./scrapai analyze page.html
📄 Analyzing: page.html
📊 HTML size: 45231 bytes
💡 TIP: Use --find 'keyword' to search for specific elements
============================================================
🏷️ HEADERS (h1, h2)
============================================================
H1 - Found 1:
[1] h1.article-headline
Text: UK economy grows 0.4% in February
H2 - Found 5:
[1] h2.section-title
Text: Economic Growth
[2] h2.section-title
Text: Market Response
...
============================================================
📝 CONTENT CONTAINERS
============================================================
[1] article.main-article
Size: 3,245 chars
Preview: The UK economy grew by 0.4% in February, official figures show...
[2] div.article-body
Size: 2,891 chars
Preview: Economists had expected growth of 0.2%, making this a positive...
============================================================
📅 DATES
============================================================
time.published-date: February 28, 2026
span.updated-time: Updated 2 hours ago
============================================================
✍️ AUTHORS
============================================================
span.author-name: Economics Reporter
a.byline: By John Smith
============================================================
Test a Selector
$ ./scrapai analyze page.html --test "h1.article-headline::text"
🔍 Testing selector: h1.article-headline::text
============================================================
✓ Found 1 element(s)
[1] h1
Classes: ['article-headline']
Text (62 chars): UK economy grows 0.4% in February
Use ::text pseudo-selector to extract text content instead of HTML.
Find by Keyword
$ ./scrapai analyze page.html --find "author"
🔎 Finding elements with keyword: 'author'
============================================================
span.author-name
Text: Economics Reporter
div.author-bio
Text: Economics Reporter specializes in UK economic policy and analysis.
a.author-profile
Text: View profile
✓ Found 3 elements
Extract all URLs from a saved HTML file. Useful for understanding URL patterns on a site during analysis.
Syntax
./scrapai extract-urls --file <html_file> [options]
Arguments
Path to HTML file to extract URLs from.
Options
Output file path. If not specified, URLs are printed to console.
Examples
# Extract URLs to console
./scrapai extract-urls --file data/news/inspect/example_com/page.html
# Extract URLs to file
./scrapai extract-urls --file page.html --output urls.txt
# Short form
./scrapai extract-urls --file page.html -o urls.txt
Output
$ ./scrapai extract-urls --file page.html
https://example.com/
https://example.com/about
https://example.com/contact
https://example.com/articles/2026/02/story-1
https://example.com/articles/2026/02/story-2
...
When using --output, URLs are written one per line to the specified file:
$ ./scrapai extract-urls --file page.html -o urls.txt
✅ Extracted 147 URLs to urls.txt
Use extract-urls after inspect to analyze URL patterns and design spider rules. Look for common patterns like /articles/[year]/[month]/[slug] to write effective regex rules.
Workflow: Inspect to Spider Config
1. Inspect the Site
./scrapai inspect https://example.com --project myproject --browser
./scrapai extract-urls --file data/myproject/inspect/example_com/page.html -o urls.txt
Review URL patterns to design effective spider rules.
3. Analyze HTML Structure
./scrapai analyze data/myproject/inspect/example_com/page.html
Note the CSS selectors for title, content, author, etc.
4. Test Selectors
./scrapai analyze page.html --test "h1.article-title::text"
./scrapai analyze page.html --test "div.article-body"
./scrapai analyze page.html --test "span.author-name::text"
5. Create Spider Config
Write spider.json using discovered selectors:
{
"name": "example_com",
"allowed_domains": ["example.com"],
"start_urls": ["https://example.com/articles"],
"rules": [
{
"allow": ["/article/[^/]+$"],
"callback": "parse_article",
"follow": false
}
],
"callbacks": {
"parse_article": {
"extract": {
"title": {"css": "h1.article-title::text"},
"content": {"css": "div.article-body", "get": "all_text"},
"author": {"css": "span.author-name::text"}
}
}
}
}
6. Import and Test
./scrapai spiders import spider.json --project myproject
./scrapai crawl example_com --project myproject --limit 5
./scrapai show example_com --project myproject
Linux (Headless Servers)
For Cloudflare bypass on headless servers:
# Install xvfb
sudo apt-get install xvfb
# Run with xvfb
xvfb-run -a ./scrapai inspect https://site.com --browser
Or check if xvfb is available:
macOS/Windows
Browser mode uses native display automatically (no xvfb needed).
Troubleshooting
Browser Mode: No Display Available
❌ ERROR: Browser mode requires a display
No display available and xvfb not installed
Solution:
sudo apt-get install xvfb
xvfb-run -a ./scrapai inspect https://site.com --browser
Browser Launch Failed
Playwright Error: Executable doesn't exist at /path/to/chromium
Solution:
.venv/bin/python -m playwright install chromium
Invalid URL
Invalid URL scheme: must be http or https
Solution: Ensure URL includes protocol:
./scrapai inspect https://example.com # Correct
./scrapai inspect example.com # Wrong
Next Steps