API Reference

Scrape API

Extract clean, structured content from any URL using the 5-layer CEP (Content Extraction Protocol) pipeline. Unlike basic scrapers, CEP handles JavaScript-rendered pages, PDFs, and complex layouts while applying QATBE token-budgeted extraction.

POST/v1/scrape

CEP pipeline layers

The Content Extraction Protocol tries each layer in order, falling back when the previous layer fails or produces insufficient content:

Layer	Method	Best for
1	CSS selector extraction	Well-structured HTML with known schemas
2	Readability algorithm	Article pages, blogs, documentation
3	Headless JS rendering	SPAs, React/Vue/Angular apps
4	PDF text extraction	PDF files, academic papers
5	Screenshot OCR	Image-heavy pages, canvas rendering

Request parameters

Parameter	Type	Required	Description
`url`	string	Yes	URL to scrape. Must be a valid HTTP(S) URL.
`tier`	string	No	Detail level (same as search). Default: `summary`.
`token_budget`	integer	No	Override tier. Range: 100–50,000.
`query`	string	No	Query to focus extraction on (boosts query-relevant segments via QATBE).
`include_metadata`	boolean	No	Include page metadata (title, description, OG tags). Default: true.
`include_links`	boolean	No	Include extracted internal/external links. Default: false.
`wait_for_js`	boolean	No	Force headless JS rendering even for static pages. Default: false.

Example request

scrape.sh

curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
  -H "Authorization: Bearer hsx_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://tokio.rs/tokio/tutorial/hello-tokio",
    "tier": "detailed",
    "query": "async runtime tutorial"
  }'

Response

response.json

{
  "meta": {
    "url": "https://tokio.rs/tokio/tutorial/hello-tokio",
    "title": "Hello Tokio — Tokio",
    "description": "A walkthrough of the Hello Tokio application",
    "tokens_used": 3210,
    "duration_ms": 892,
    "cep_layer_used": 2,
    "result_id": "f1e2d3c4-..."
  },
  "content": "Tokio is an asynchronous runtime for the Rust programming language...",
  "segments": [
    {
      "type": "heading",
      "text": "Hello Tokio",
      "level": 1,
      "token_count": 2
    },
    {
      "type": "paragraph",
      "text": "We will start by writing a very basic Tokio application...",
      "token_count": 48
    },
    {
      "type": "code",
      "text": "#[tokio::main]\nasync fn main() {\n    println!(\"Hello, Tokio!\");\n}",
      "language": "rust",
      "token_count": 22
    }
  ]
}

Response fields

Field	Type	Description
`meta.cep_layer_used`	integer	Which CEP layer (1–5) produced the content
`content`	string	Clean extracted text within token budget
`segments`	array	SCS-segmented content blocks with type and token count
`segments[].type`	string	`heading`, `paragraph`, `code`, `list`, `table`, `quote`, `metadata`, `other`

PDF extraction

Point the scrape endpoint at any PDF URL — CEP layer 4 handles extraction automatically:

bash

curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
  -H "Authorization: Bearer hsx_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://arxiv.org/pdf/2301.00234.pdf", "tier": "detailed"}'

JavaScript-rendered pages

CEP automatically detects and uses headless rendering for SPAs. For pages where static extraction is insufficient, force JS rendering:

json

{
  "url": "https://app.example.com/dashboard",
  "wait_for_js": true,
  "tier": "summary"
}

Next steps

Search API →

Federated multi-backend search

Research API →

Deep multi-source research

CEP Algorithm →

How content extraction works

QATBE Algorithm →

Token budget extraction