CEP — Content Extraction Protocol
CEP is a 5-layer cascade extraction system that handles any web content — from clean articles to JavaScript-heavy SPAs to PDF documents. Each layer attempts extraction and falls back to the next on failure, ensuring maximum coverage with minimum cost.
The 5 layers
Tries known content selectors: article, main, .post-content, .article-body, [itemprop="articleBody"], etc. Works for 70%+ of news and blog sites. Fastest layer — pure DOM traversal, no network calls.
Mozilla Readability algorithm port. Scores DOM nodes by content density (text-to-tag ratio), cleans boilerplate (navs, footers, ads), and extracts the main content block. Works for most static HTML pages.
Chromium-based full page render via chromiumoxide. Executes JavaScript, waits for DOM mutations, then applies Layer 1+2. Required for SPAs (React/Vue/Angular) that render content client-side. Controlled via hsx setup --headless.
Detects application/pdf content type and switches to PDF text extraction. Preserves heading hierarchy, extracts tables, and handles multi-column layouts. Handles academic papers, reports, and documentation PDFs.
Last resort — captures a screenshot with headless Chrome and runs OCR. Handles image-based PDFs, paywalled previews, and sites that actively block extraction. Slowest but most universal.
Layer selection logic
CEP tries layers in order 1→2→3→4→5 until one succeeds. It also uses content-type hints to skip layers when possible:
Quality confidence scoring
Each layer reports an extraction confidence score (0–1) based on:
- Content length — very short extractions score lower
- Text-to-markup ratio — high markup density suggests boilerplate
- Sentence completeness — truncated sentences score lower
- Language detection — non-target-language content scores lower
QADD integration
After extraction, CEP passes content through QADD (Query-Aware DOM Distillation), which applies 5 pruning steps to reduce token count by 10–20x while preserving query-relevant content:
- Remove navigation, ads, footers, and cookie banners
- Collapse whitespace and normalize Unicode
- Score sentences by query relevance (BM25)
- Keep top-K sentences within token budget
- Reconstruct coherent paragraphs from retained sentences
CLI usage
# Extract content from a URL (auto-detects best layer)
hsx fetch https://example.com/article
# Force specific layer
hsx fetch https://spa-app.com --headless
# Extract as PDF
hsx fetch https://arxiv.org/pdf/2301.00001.pdf
# With token budget
hsx fetch https://example.com --tokens 2000