API Reference
Scrape API
Extract clean, structured content from any URL using the 5-layer CEP (Content Extraction Protocol) pipeline. Unlike basic scrapers, CEP handles JavaScript-rendered pages, PDFs, and complex layouts while applying QATBE token-budgeted extraction.
POST/v1/scrape
CEP pipeline layers
The Content Extraction Protocol tries each layer in order, falling back when the previous layer fails or produces insufficient content:
| Layer | Method | Best for |
|---|---|---|
| 1 | CSS selector extraction | Well-structured HTML with known schemas |
| 2 | Readability algorithm | Article pages, blogs, documentation |
| 3 | Headless JS rendering | SPAs, React/Vue/Angular apps |
| 4 | PDF text extraction | PDF files, academic papers |
| 5 | Screenshot OCR | Image-heavy pages, canvas rendering |
Request parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape. Must be a valid HTTP(S) URL. |
tier | string | No | Detail level (same as search). Default: summary. |
token_budget | integer | No | Override tier. Range: 100–50,000. |
query | string | No | Query to focus extraction on (boosts query-relevant segments via QATBE). |
include_metadata | boolean | No | Include page metadata (title, description, OG tags). Default: true. |
include_links | boolean | No | Include extracted internal/external links. Default: false. |
wait_for_js | boolean | No | Force headless JS rendering even for static pages. Default: false. |
Example request
scrape.sh
curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
-H "Authorization: Bearer hsx_your_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://tokio.rs/tokio/tutorial/hello-tokio",
"tier": "detailed",
"query": "async runtime tutorial"
}'Response
response.json
{
"meta": {
"url": "https://tokio.rs/tokio/tutorial/hello-tokio",
"title": "Hello Tokio — Tokio",
"description": "A walkthrough of the Hello Tokio application",
"tokens_used": 3210,
"duration_ms": 892,
"cep_layer_used": 2,
"result_id": "f1e2d3c4-..."
},
"content": "Tokio is an asynchronous runtime for the Rust programming language...",
"segments": [
{
"type": "heading",
"text": "Hello Tokio",
"level": 1,
"token_count": 2
},
{
"type": "paragraph",
"text": "We will start by writing a very basic Tokio application...",
"token_count": 48
},
{
"type": "code",
"text": "#[tokio::main]\nasync fn main() {\n println!(\"Hello, Tokio!\");\n}",
"language": "rust",
"token_count": 22
}
]
}Response fields
| Field | Type | Description |
|---|---|---|
meta.cep_layer_used | integer | Which CEP layer (1–5) produced the content |
content | string | Clean extracted text within token budget |
segments | array | SCS-segmented content blocks with type and token count |
segments[].type | string | heading, paragraph, code, list, table, quote, metadata, other |
PDF extraction
Point the scrape endpoint at any PDF URL — CEP layer 4 handles extraction automatically:
bash
curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
-H "Authorization: Bearer hsx_your_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://arxiv.org/pdf/2301.00234.pdf", "tier": "detailed"}'JavaScript-rendered pages
CEP automatically detects and uses headless rendering for SPAs. For pages where static extraction is insufficient, force JS rendering:
json
{
"url": "https://app.example.com/dashboard",
"wait_for_js": true,
"tier": "summary"
}