Web scraping API. Returns clean Markdown, HTML, text, or structured JSON from any URL.
Scraping
- Single URL scraping with multiple output formats
- Async site crawling (BFS, depth/page limits, Redis job queue)
- Structured extraction via OpenAI, CSS selectors, XPath, or regex
Browser
- Auto JS detection - curl_cffi first, Patchright fallback for JS-heavy pages
- Anti-bot evasion - TLS fingerprint impersonation, stealth browser profiles
- Proxy rotation with health checking
Production
- Per-domain and API-level rate limiting
- Robots.txt compliance
- Unified error responses with machine-readable error codes
- SSE heartbeats for reliable streaming through reverse proxies
- Idempotency keys for safe retries on crawl/batch endpoints
- Request ID correlation and response timing headers
- Docker Compose deployment (API + worker + Redis)
End-to-end scraping, median of 5 runs, books.toscrape.com, lower is better.
Pawgrab extracts article-quality content via Readability while raw BS4/lxml return ALL page text (nav, footer, ads, etc). Pawgrab also adds TLS fingerprint impersonation + anti-bot evasion.
pip install pawgrab
patchright install chromium# Start Redis (needed for /crawl)
docker run -d -p 6379:6379 redis:7-alpine
# Configure
cp .env.example .env
# Set PAWGRAB_OPENAI_API_KEY if you need /extract
# Run
pawgrab serveOr with Docker:
cp .env.example .env
docker compose upAll endpoints under /v1.
curl -X POST http://localhost:8000/v1/scrape \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com"}'| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
array | ["markdown"] |
markdown, html, text, json |
wait_for_js |
bool/null | null |
Force JS (true), skip (false), auto (null) |
timeout |
int | 30000 |
Timeout in ms |
Returns job ID (HTTP 202). Poll with GET /v1/crawl/{job_id}.
curl -X POST http://localhost:8000/v1/crawl \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "max_pages": 5}'Supports Idempotency-Key header for safe retries.
Requires PAWGRAB_OPENAI_API_KEY.
curl -X POST http://localhost:8000/v1/extract \
-H 'Content-Type: application/json' \
-d '{"url": "https://example.com", "prompt": "Extract the main heading"}'Searches the web and scrapes each result in parallel.
curl -X POST http://localhost:8000/v1/search \
-H 'Content-Type: application/json' \
-d '{"query": "python web scraping"}'Returns ok, degraded, or unhealthy with per-component checks.
curl http://localhost:8000/healthAll errors return a consistent JSON shape:
{
"success": false,
"error": "Human-readable message",
"code": "machine_readable_code",
"details": null,
"request_id": "a1b2c3d4e5f6"
}Error codes: validation_error, invalid_api_key, rate_limited, robots_blocked, resource_not_found, timeout, fetch_failed, browser_unavailable, queue_unavailable, llm_unavailable, extraction_failed, search_failed, internal_error.
Every response includes:
| Header | Description |
|---|---|
X-Request-ID |
Unique request identifier for correlation |
X-API-Version |
API version |
X-Response-Time |
Request duration (e.g. 42.3ms) |
X-RateLimit-Limit |
Requests allowed per minute |
X-RateLimit-Remaining |
Requests remaining in current window |
pawgrab scrape https://example.com
pawgrab scrape https://example.com --format text
pawgrab extract https://example.com --prompt "Extract the main heading"
pawgrab serve --port 8000 --reloadAll settings via env vars with PAWGRAB_ prefix. See .env.example for the full list.
Key settings:
| Variable | Default | Description |
|---|---|---|
PAWGRAB_API_KEY |
empty | API key for Bearer auth (empty = no auth) |
PAWGRAB_RATE_LIMIT_RPM |
60 |
Per-domain rate limit (requests/min) |
PAWGRAB_API_RATE_LIMIT_RPM |
600 |
API-level rate limit per client (requests/min) |
PAWGRAB_REDIS_URL |
redis://localhost:6379/0 |
Redis for job queue and idempotency |
