Algorithms

CEP — Content Extraction Protocol

CEP is a 5-layer cascade extraction system that handles any web content — from clean articles to JavaScript-heavy SPAs to PDF documents. Each layer attempts extraction and falls back to the next on failure, ensuring maximum coverage with minimum cost.

The 5 layers

1
CSS Selectors~1ms

Tries known content selectors: article, main, .post-content, .article-body, [itemprop="articleBody"], etc. Works for 70%+ of news and blog sites. Fastest layer — pure DOM traversal, no network calls.

2
Readability~5ms

Mozilla Readability algorithm port. Scores DOM nodes by content density (text-to-tag ratio), cleans boilerplate (navs, footers, ads), and extracts the main content block. Works for most static HTML pages.

3
Headless JS~2s

Chromium-based full page render via chromiumoxide. Executes JavaScript, waits for DOM mutations, then applies Layer 1+2. Required for SPAs (React/Vue/Angular) that render content client-side. Controlled via hsx setup --headless.

4
PDF Extraction~50ms

Detects application/pdf content type and switches to PDF text extraction. Preserves heading hierarchy, extracts tables, and handles multi-column layouts. Handles academic papers, reports, and documentation PDFs.

5
Screenshot OCR~3s

Last resort — captures a screenshot with headless Chrome and runs OCR. Handles image-based PDFs, paywalled previews, and sites that actively block extraction. Slowest but most universal.

Layer selection logic

CEP tries layers in order 1→2→3→4→5 until one succeeds. It also uses content-type hints to skip layers when possible:

Content-Type: application/pdf → skip to Layer 4
URL ends in .pdf → skip to Layer 4
Layer 1 confidence < 0.4 → try Layer 2
Layer 2 fails (empty body) → try Layer 3 (headless)
Layer 3 disabled (no Chrome) → skip to Layer 4/5
Layer 4 fails (not a PDF) → try Layer 5 (OCR)

Quality confidence scoring

Each layer reports an extraction confidence score (0–1) based on:

  • Content length — very short extractions score lower
  • Text-to-markup ratio — high markup density suggests boilerplate
  • Sentence completeness — truncated sentences score lower
  • Language detection — non-target-language content scores lower

QADD integration

After extraction, CEP passes content through QADD (Query-Aware DOM Distillation), which applies 5 pruning steps to reduce token count by 10–20x while preserving query-relevant content:

  1. Remove navigation, ads, footers, and cookie banners
  2. Collapse whitespace and normalize Unicode
  3. Score sentences by query relevance (BM25)
  4. Keep top-K sentences within token budget
  5. Reconstruct coherent paragraphs from retained sentences

CLI usage

bash
# Extract content from a URL (auto-detects best layer)
hsx fetch https://example.com/article

# Force specific layer
hsx fetch https://spa-app.com --headless

# Extract as PDF
hsx fetch https://arxiv.org/pdf/2301.00001.pdf

# With token budget
hsx fetch https://example.com --tokens 2000

Next steps