API Reference

Scrape API

Extract clean, structured content from any URL using the 5-layer CEP (Content Extraction Protocol) pipeline. Unlike basic scrapers, CEP handles JavaScript-rendered pages, PDFs, and complex layouts while applying QATBE token-budgeted extraction.

POST/v1/scrape

CEP pipeline layers

The Content Extraction Protocol tries each layer in order, falling back when the previous layer fails or produces insufficient content:

LayerMethodBest for
1CSS selector extractionWell-structured HTML with known schemas
2Readability algorithmArticle pages, blogs, documentation
3Headless JS renderingSPAs, React/Vue/Angular apps
4PDF text extractionPDF files, academic papers
5Screenshot OCRImage-heavy pages, canvas rendering

Request parameters

ParameterTypeRequiredDescription
urlstringYesURL to scrape. Must be a valid HTTP(S) URL.
tierstringNoDetail level (same as search). Default: summary.
token_budgetintegerNoOverride tier. Range: 100–50,000.
querystringNoQuery to focus extraction on (boosts query-relevant segments via QATBE).
include_metadatabooleanNoInclude page metadata (title, description, OG tags). Default: true.
include_linksbooleanNoInclude extracted internal/external links. Default: false.
wait_for_jsbooleanNoForce headless JS rendering even for static pages. Default: false.

Example request

scrape.sh
curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
  -H "Authorization: Bearer hsx_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://tokio.rs/tokio/tutorial/hello-tokio",
    "tier": "detailed",
    "query": "async runtime tutorial"
  }'

Response

response.json
{
  "meta": {
    "url": "https://tokio.rs/tokio/tutorial/hello-tokio",
    "title": "Hello Tokio — Tokio",
    "description": "A walkthrough of the Hello Tokio application",
    "tokens_used": 3210,
    "duration_ms": 892,
    "cep_layer_used": 2,
    "result_id": "f1e2d3c4-..."
  },
  "content": "Tokio is an asynchronous runtime for the Rust programming language...",
  "segments": [
    {
      "type": "heading",
      "text": "Hello Tokio",
      "level": 1,
      "token_count": 2
    },
    {
      "type": "paragraph",
      "text": "We will start by writing a very basic Tokio application...",
      "token_count": 48
    },
    {
      "type": "code",
      "text": "#[tokio::main]\nasync fn main() {\n    println!(\"Hello, Tokio!\");\n}",
      "language": "rust",
      "token_count": 22
    }
  ]
}

Response fields

FieldTypeDescription
meta.cep_layer_usedintegerWhich CEP layer (1–5) produced the content
contentstringClean extracted text within token budget
segmentsarraySCS-segmented content blocks with type and token count
segments[].typestringheading, paragraph, code, list, table, quote, metadata, other

PDF extraction

Point the scrape endpoint at any PDF URL — CEP layer 4 handles extraction automatically:

bash
curl -X POST https://api.hypersearchx.zuhabul.com/v1/scrape \
  -H "Authorization: Bearer hsx_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://arxiv.org/pdf/2301.00234.pdf", "tier": "detailed"}'

JavaScript-rendered pages

CEP automatically detects and uses headless rendering for SPAs. For pages where static extraction is insufficient, force JS rendering:

json
{
  "url": "https://app.example.com/dashboard",
  "wait_for_js": true,
  "tier": "summary"
}

Next steps