Overview
Parse is a synchronous API that extracts structured content from web pages. Give it a URL or raw HTML, and it returns structured data -- title, author, article text, images, and metadata -- in a single request. No polling, no webhooks. One call and you have your data.
What Parse Does
Send a POST /parse request with either a URL or raw HTML, and Parse returns structured content extracted by AI. The response includes the page's primary content, consumability assessment, and HTTP metadata from the scrape. Everything happens synchronously -- the response contains your extracted data.
How It Works
Parse operates in two modes:
- URL mode: Provide a
url. Parse loads the page in a headless browser, then runs AI extraction on the rendered content. - HTML mode: Provide raw HTML in the
htmlfield. Parse skips the browser scrape entirely and runs AI extraction directly on the provided markup.
The Extraction Pipeline
Regardless of mode, content flows through three stages:
- Content acquisition -- For URL mode, a headless browser loads and renders the page. For HTML mode, the provided markup is stored directly.
- AI-powered structured extraction -- An LLM with structured output analyzes the content, extracting title, author, body text, images, video, and metadata.
- Consumability evaluation -- The extracted content is assessed to determine whether the page contains meaningful, self-contained content.
Understanding the Response
The response data object contains four top-level fields:
hasPrimaryContent (boolean) -- Quick check for whether the page had extractable content. Useful for filtering before inspecting the full response.
consumability (object) -- Contains isConsumable (boolean) and reason (string) explaining the assessment. See the Consumability section below.
primaryContent (object | null) -- The extracted content. Null when no content could be extracted. Contains:
| Field | Type | Description |
|---|---|---|
title | string | null | Page or article title |
description | string | null | Summary or meta description |
author | string | null | Content author |
publisher | string | null | Publishing organization |
publishedAt | string | null | Original publication date |
updatedAt | string | null | Last update date |
isSponsored | boolean | null | Whether the content is sponsored |
isDigest | boolean | null | Whether the content is a digest or roundup |
accessRestrictionType | string[] | null | Detected access restrictions |
text | object | null | Body content with simplifiedHtml |
video | object | null | Video URL and duration if present |
primaryImage | object | null | Primary image with caption and credit |
originallyPublished | object | null | Original source info for syndicated content |
All fields are nullable because not every page has every field.
scrape (object) -- Contains the HTTP status code from the browser scrape. Only present in URL mode. Absent from the response entirely in HTML mode.
Consumability
Consumability answers the question: does this page contain self-contained, meaningful content that a reader could consume?
Consumable examples: news articles, blog posts, product pages, event listings, documentation pages, recipe pages.
Not consumable examples: homepages with only navigation links, search results pages, error pages, login forms, bot detection pages, index pages.
The reason field provides a natural language explanation of the assessment:
- "Page contains a full news article with headline, byline, and body text."
- "Page is a 404 error with no consumable article content."
- "No primary textual content found on the provided page copy."
Access Restrictions
Parse detects when content is blocked or restricted. The primaryContent.accessRestrictionType field returns an array of restriction types when detected, or null when no restrictions are found.
Restriction types:
| Type | Description |
|---|---|
subscription-required | Content behind a paywall |
bot-detected | Page served a bot detection challenge |
captcha | CAPTCHA presented instead of content |
adblock-detected | Page blocked content due to ad blocker detection |
login-required | Content requires authentication |
geo | Content restricted by geographic location |
other | Other restriction not covered above |
Content Types
Parse handles a range of page types:
- Text articles -- News stories, blog posts, documentation. Extracts title, author, body text, and metadata.
- Video pages -- Extracts video URL and duration alongside any surrounding text content.
- Image-heavy pages -- Extracts the primary image with caption and credit information.
- Syndicated content -- Detects republished or syndicated articles and provides original source information via
originallyPublished.
Idempotency
Parse supports an optional jobId parameter for deduplication:
jobIduniqueness is scoped per organization -- different orgs can reuse the samejobId.- Submitting the same
jobIdwithin the same org reconnects to the existing workflow result rather than starting a new parse. - Without a
jobId, a random UUID is generated for each request. - If a previous job with the same
jobIdfailed, Parse returns an error asking you to retry with a newjobId.
Processing Time
Processing times are approximate and vary based on page complexity, server load, and network conditions.
| Scenario | Approximate Time |
|---|---|
| Simple pages (e.g., example.com) | ~10-20s |
| Standard articles | ~20-45s |
| Heavy JS-rendered pages | ~30-60s |
| Maximum timeout | 10 minutes |
Limits
| Limit | Value |
|---|---|
| Max HTML size | 2MB |
| Max request body size | 3MB |
| Max title length | 1,000 characters |
| Max jobId length | 256 characters |
| Workflow timeout | 10 minutes |
Next Steps
- Quickstart: Get parsing working in under 2 minutes
- API Reference: Complete endpoint documentation