Skip to main content
The inspect command fetches and analyzes a website to help you understand its structure and build scraper configurations. It supports three modes: lightweight HTTP, browser-based (for JavaScript sites), and Cloudflare bypass.

inspect

Inspect a website URL.

Syntax

./scrapai inspect <url> [options]

Arguments

url
string
required
Website URL to inspect.

Options

--project
string
default:"default"
Project name (used for saving analysis files).
--output-dir
string
Directory to save analysis files. Defaults to data/<project>/inspect/.
--proxy-type
choice
default:"auto"
Proxy type: none, static, residential, auto.
--no-save-html
flag
Do not save the full HTML to disk.
--browser
flag
Use browser automation for JavaScript-rendered sites and Cloudflare bypass. Automatically handles browser challenges and renders dynamic content.
--log-level
choice
default:"info"
Logging level: debug, info, warning, error, critical.
--log-file
string
Path to log file.

Modes

HTTP Mode (Default)

Lightweight HTTP fetch with requests library:
./scrapai inspect https://example.com --project myproject
Use for:
  • Simple websites with server-side rendering
  • Static HTML sites
  • Fastest inspection method
Output:
 Using lightweight HTTP fetch

Fetching: https://example.com
Status: 200
Content-Type: text/html; charset=utf-8
Content-Length: 45,231 bytes

Analysis saved to: data/myproject/inspect/example_com/
 page.html (full HTML)
 metadata.json (headers, status, timing)

Browser Mode

Use Playwright for JavaScript-heavy sites:
./scrapai inspect https://spa-site.com --project myproject --browser
Use for:
  • Single-page applications (React, Vue, Angular)
  • Sites with JavaScript-rendered content
  • Dynamic content loading
Output:
🌐 Using browser for JS-rendered content

Launching browser...
Navigating to: https://spa-site.com
Waiting for page load...
Content rendered!

Analysis saved to: data/myproject/inspect/spa_site_com/
 page.html (rendered HTML after JS execution)
 screenshot.png (page screenshot)
 metadata.json
Browser mode waits for JavaScript to execute and renders the final DOM. This is the HTML you should analyze for extraction selectors.
The --browser flag automatically handles both JavaScript rendering and Cloudflare bypass when needed. Additional browser mode features:
  • Automatic Cloudflare challenge detection and bypass
  • Cookie extraction for session persistence
  • Browser fingerprinting resistance
Linux (headless server): Requires xvfb for browser automation:
xvfb-run -a ./scrapai inspect https://protected-site.com --project myproject --browser
Install xvfb if needed:
sudo apt-get install xvfb
Output when Cloudflare is detected:
🖥️  Browser mode enabled

Solving Cloudflare challenge...
Challenge solved!
Extracting cookies...

Analysis saved to: data/myproject/inspect/protected_site_com/
 page.html (final HTML after bypass)
 cookies.json (Cloudflare session cookies)
 metadata.json

Saved Files

Inspection saves files to data/<project>/inspect/<domain>/:

page.html

Full HTML content:
  • HTTP mode: Raw HTML from server
  • Browser mode: Rendered HTML after JavaScript execution, with automatic Cloudflare bypass when detected
Use this file with the analyze command to discover CSS selectors.

metadata.json

Request metadata:
{
  "url": "https://example.com",
  "status_code": 200,
  "headers": {
    "content-type": "text/html; charset=utf-8",
    "server": "nginx",
    "content-length": "45231"
  },
  "fetch_time": "2026-02-28T15:30:42",
  "elapsed_seconds": 0.823,
  "mode": "http"
}

cookies.json (Browser mode with Cloudflare)

Cloudflare session cookies:
{
  "cf_clearance": "abc123...",
  "__cf_bm": "xyz789...",
  "expires_at": "2026-02-28T15:40:42"
}
These cookies can be used in spider settings to bypass Cloudflare without browser on every request.

screenshot.png (Browser mode)

Full-page screenshot for visual verification.

Proxy Support

Specify proxy type for inspection:
# No proxy
./scrapai inspect https://example.com --proxy-type none

# Datacenter proxy
./scrapai inspect https://example.com --proxy-type static

# Residential proxy
./scrapai inspect https://example.com --proxy-type residential

# Auto (smart escalation)
./scrapai inspect https://example.com --proxy-type auto
Requires proxy configuration in .env.

Skip HTML Saving

For quick inspection without saving files:
./scrapai inspect https://example.com --no-save-html
Prints analysis to console only, doesn’t write files.

Logging

Control logging verbosity:
# Debug logging
./scrapai inspect https://example.com --log-level debug

# Save logs to file
./scrapai inspect https://example.com --log-file inspect.log

analyze

Analyze saved HTML for CSS selector discovery (separate command, not a subcommand of inspect).

Syntax

./scrapai analyze <html_file> [options]

Arguments

html_file
string
required
Path to HTML file to analyze.

Options

--test
string
Test a specific CSS selector.
--find
string
Find elements by keyword (searches classes and IDs).

Examples

# Analyze HTML structure
./scrapai analyze data/news/inspect/example_com/page.html

# Test a CSS selector
./scrapai analyze page.html --test "article.post h1.title"

# Find elements with keyword
./scrapai analyze page.html --find "author"

Output (Analysis Mode)

$ ./scrapai analyze page.html
📄 Analyzing: page.html
📊 HTML size: 45231 bytes

💡 TIP: Use --find 'keyword' to search for specific elements

============================================================
🏷️  HEADERS (h1, h2)
============================================================

H1 - Found 1:
  [1] h1.article-headline
      Text: UK economy grows 0.4% in February

H2 - Found 5:
  [1] h2.section-title
      Text: Economic Growth
  [2] h2.section-title
      Text: Market Response
  ...

============================================================
📝 CONTENT CONTAINERS
============================================================

  [1] article.main-article
      Size: 3,245 chars
      Preview: The UK economy grew by 0.4% in February, official figures show...

  [2] div.article-body
      Size: 2,891 chars
      Preview: Economists had expected growth of 0.2%, making this a positive...

============================================================
📅 DATES
============================================================
  time.published-date: February 28, 2026
  span.updated-time: Updated 2 hours ago

============================================================
✍️  AUTHORS
============================================================
  span.author-name: Economics Reporter
  a.byline: By John Smith

============================================================

Test a Selector

$ ./scrapai analyze page.html --test "h1.article-headline::text"

🔍 Testing selector: h1.article-headline::text
============================================================
 Found 1 element(s)

[1] h1
    Classes: ['article-headline']
    Text (62 chars): UK economy grows 0.4% in February
Use ::text pseudo-selector to extract text content instead of HTML.

Find by Keyword

$ ./scrapai analyze page.html --find "author"

🔎 Finding elements with keyword: 'author'
============================================================

  span.author-name
    Text: Economics Reporter

  div.author-bio
    Text: Economics Reporter specializes in UK economic policy and analysis.

  a.author-profile
    Text: View profile

 Found 3 elements

extract-urls

Extract all URLs from a saved HTML file. Useful for understanding URL patterns on a site during analysis.

Syntax

./scrapai extract-urls --file <html_file> [options]

Arguments

--file
string
required
Path to HTML file to extract URLs from.

Options

--output
string
Output file path. If not specified, URLs are printed to console.
-o
string
Short form of --output.

Examples

# Extract URLs to console
./scrapai extract-urls --file data/news/inspect/example_com/page.html

# Extract URLs to file
./scrapai extract-urls --file page.html --output urls.txt

# Short form
./scrapai extract-urls --file page.html -o urls.txt

Output

$ ./scrapai extract-urls --file page.html
https://example.com/
https://example.com/about
https://example.com/contact
https://example.com/articles/2026/02/story-1
https://example.com/articles/2026/02/story-2
...
When using --output, URLs are written one per line to the specified file:
$ ./scrapai extract-urls --file page.html -o urls.txt
 Extracted 147 URLs to urls.txt
Use extract-urls after inspect to analyze URL patterns and design spider rules. Look for common patterns like /articles/[year]/[month]/[slug] to write effective regex rules.

Workflow: Inspect to Spider Config

1. Inspect the Site

./scrapai inspect https://example.com --project myproject --browser

2. Extract URLs (Optional)

./scrapai extract-urls --file data/myproject/inspect/example_com/page.html -o urls.txt
Review URL patterns to design effective spider rules.

3. Analyze HTML Structure

./scrapai analyze data/myproject/inspect/example_com/page.html
Note the CSS selectors for title, content, author, etc.

4. Test Selectors

./scrapai analyze page.html --test "h1.article-title::text"
./scrapai analyze page.html --test "div.article-body"
./scrapai analyze page.html --test "span.author-name::text"

5. Create Spider Config

Write spider.json using discovered selectors:
{
  "name": "example_com",
  "allowed_domains": ["example.com"],
  "start_urls": ["https://example.com/articles"],
  "rules": [
    {
      "allow": ["/article/[^/]+$"],
      "callback": "parse_article",
      "follow": false
    }
  ],
  "callbacks": {
    "parse_article": {
      "extract": {
        "title": {"css": "h1.article-title::text"},
        "content": {"css": "div.article-body", "get": "all_text"},
        "author": {"css": "span.author-name::text"}
      }
    }
  }
}

6. Import and Test

./scrapai spiders import spider.json --project myproject
./scrapai crawl example_com --project myproject --limit 5
./scrapai show example_com --project myproject

Platform Notes

Linux (Headless Servers)

For Cloudflare bypass on headless servers:
# Install xvfb
sudo apt-get install xvfb

# Run with xvfb
xvfb-run -a ./scrapai inspect https://site.com --browser
Or check if xvfb is available:
which xvfb-run

macOS/Windows

Browser mode uses native display automatically (no xvfb needed).

Troubleshooting

Browser Mode: No Display Available

 ERROR: Browser mode requires a display
   No display available and xvfb not installed
Solution:
sudo apt-get install xvfb
xvfb-run -a ./scrapai inspect https://site.com --browser

Browser Launch Failed

Playwright Error: Executable doesn't exist at /path/to/chromium
Solution:
.venv/bin/python -m playwright install chromium

Invalid URL

Invalid URL scheme: must be http or https
Solution: Ensure URL includes protocol:
./scrapai inspect https://example.com  # Correct
./scrapai inspect example.com          # Wrong

Next Steps